{"id":64047,"date":"2020-01-09T10:00:32","date_gmt":"2020-01-09T09:00:32","guid":{"rendered":"http:\/\/e3mag.com\/?p=64047"},"modified":"2020-02-07T19:35:33","modified_gmt":"2020-02-07T18:35:33","slug":"the-technology-data","status":"publish","type":"post","link":"https:\/\/e3mag.com\/en\/the-technology-data\/","title":{"rendered":"The technology of a data lake"},"content":{"rendered":"<p>In general, I would divide the data storage options into three categories. Files: The data is stored as simple files and used like tables.<\/p>\n<p>These files should have information about the structure and should also be indexed. The Parquet file format is a representative of this category.<\/p><div id=\"great-1568821302\" class=\"great-fullsize-content-en\" style=\"margin-bottom: 20px;\"><a data-no-instant=\"1\" href=\"https:\/\/www.youtube.com\/watch?v=6yfv7eho3Gc\" rel=\"noopener\" class=\"a2t-link\" target=\"_blank\" aria-label=\"Fullsize\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/e3mag.com\/wp-content\/uploads\/2026\/03\/banner_26_04_08_1200x150.jpg\" alt=\"Fullsize\"  srcset=\"https:\/\/e3mag.com\/wp-content\/uploads\/2026\/03\/banner_26_04_08_1200x150.jpg 1200w, https:\/\/e3mag.com\/wp-content\/uploads\/2026\/03\/banner_26_04_08_1200x150-400x50.jpg 400w, https:\/\/e3mag.com\/wp-content\/uploads\/2026\/03\/banner_26_04_08_1200x150-768x96.jpg 768w, https:\/\/e3mag.com\/wp-content\/uploads\/2026\/03\/banner_26_04_08_1200x150-100x13.jpg 100w, https:\/\/e3mag.com\/wp-content\/uploads\/2026\/03\/banner_26_04_08_1200x150-480x60.jpg 480w, https:\/\/e3mag.com\/wp-content\/uploads\/2026\/03\/banner_26_04_08_1200x150-640x80.jpg 640w, https:\/\/e3mag.com\/wp-content\/uploads\/2026\/03\/banner_26_04_08_1200x150-720x90.jpg 720w, https:\/\/e3mag.com\/wp-content\/uploads\/2026\/03\/banner_26_04_08_1200x150-960x120.jpg 960w, https:\/\/e3mag.com\/wp-content\/uploads\/2026\/03\/banner_26_04_08_1200x150-1168x146.jpg 1168w, https:\/\/e3mag.com\/wp-content\/uploads\/2026\/03\/banner_26_04_08_1200x150-18x2.jpg 18w, https:\/\/e3mag.com\/wp-content\/uploads\/2026\/03\/banner_26_04_08_1200x150-600x75.jpg 600w\" sizes=\"(max-width: 1200px) 100vw, 1200px\" width=\"1200\" height=\"150\"  style=\" max-width: 100%; height: auto;\" \/><\/a><\/div>\n<p>Database process: Instead of working directly with the files, there is an active service on top that feels like a database. It takes care of caching frequently used data and can be queried via ODBC\/JDBC. A typical representative of this type in the big data world is Apache Hive.<\/p>\n<p>In-memory: For maximum performance, all data is stored in memory, indexed and used to build something similar to Hana. Exasol and SAP Vora work according to this principle.<\/p>\n<p>The big data world lives solely from the fact that many small (and therefore inexpensive) servers form an overall system. This allows you to scale infinitely and the hardware costs only increase linearly.<\/p>\n<p>But the more nodes form the overall system, the more expensive their synchronization becomes. A link (\"join\") of three or even more tables can mean that each node has to fetch the appropriate intermediate results from the previous join and the query runs for hours.<\/p>\n<p>This problem is called \"reshuffle\". Of course, the fact that the data is stored in memory does not help when redistributing the intermediate results via the network.<\/p>\n<p>Hana, on the other hand, is a real database. It is extremely fast when searching. The join performance is great, you have full transaction consistency when reading and writing. All of this requires a lot of synchronization.<\/p>\n<p>However, such a database does not scale infinitely. Many projects solve the \"reshuffle\" dilemma by storing the data in an optimized way for certain queries. This in turn reduces flexibility and increases costs, i.e. precisely the points that were actually intended as advantages of a data lake.<\/p>\n<p>The synchronization effort of transaction consistency is a logical problem. It cannot be solved without imposing softer requirements, such as \"eventual consistency\".<\/p>\n<p>This problem is known as the CAP theorem. Of the three requirements Consistency-Availability-Partitioning, all of the points can never be achieved, especially in the event of an error.<\/p>\n<p>A highly available and distributed system must make compromises in terms of data consistency, while a transactional database system must make compromises in terms of availability or scalability.<\/p>\n<p>The data available in Big Data is raw data that becomes information through non-SQL transformations - so a Big Data-based data warehouse with SQL queries makes no sense.<\/p>\n<p>The data lake is the playground for the data scientist. This person has easy access to data that was previously deleted or was difficult to access.<\/p>\n<p>The data scientist can deal with all the problems that arise from big data technology: Semantics of the data; slow performance; and, what data is there. Mixing big data and business data? No problem for him.<\/p>\n<p>Coupling Hana with Vora makes little sense from this point of view. Both store the data in-memory and allow fast searches - with corresponding costs. Both have warm storage on disk (Sybase database), both focus on SQL queries. Vora is also no longer on SAP's price list as a stand-alone product.<\/p>\n<p>Parquet files and a database, on the other hand, complement each other perfectly. The parquet files in a data lake cost practically nothing to store, whereas storage space in the database is expensive.<\/p>\n<p>A database like Hana is excellent for joins and complicated SQL queries, but for a compute cluster these operations are the most complex.<\/p>\n<p>The combination of the two results in fast business intelligence queries and convenient access to all raw data. Both contribute their strengths.<\/p>","protected":false},"excerpt":{"rendered":"<p>In a data warehouse, the data is stored in a relational database. This is expensive and accordingly there are products from the Big Data world that start here. Parquet, Hive, SAP Vora and Exasol are the best-known representatives in the SAP environment.<\/p>","protected":false},"author":1891,"featured_media":62136,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"pmpro_default_level":"","footnotes":""},"categories":[7,37003,36004],"tags":[210,6062,338],"coauthors":[36006],"class_list":["post-64047","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-meinung","category-mag-1912","category-smart-big-data-integration","tag-big-data","tag-data-scientist","tag-sql","pmpro-has-access"],"acf":[],"featured_image_urls_v2":{"full":["https:\/\/e3mag.com\/wp-content\/uploads\/2019\/08\/Smart-and-Big-Data-Integration.jpg",1000,450,false],"thumbnail":["https:\/\/e3mag.com\/wp-content\/uploads\/2019\/08\/Smart-and-Big-Data-Integration-150x150.jpg",150,150,true],"medium":["https:\/\/e3mag.com\/wp-content\/uploads\/2019\/08\/Smart-and-Big-Data-Integration.jpg",400,180,false],"medium_large":["https:\/\/e3mag.com\/wp-content\/uploads\/2019\/08\/Smart-and-Big-Data-Integration-768x346.jpg",768,346,true],"large":["https:\/\/e3mag.com\/wp-content\/uploads\/2019\/08\/Smart-and-Big-Data-Integration.jpg",1000,450,false],"image-100":["https:\/\/e3mag.com\/wp-content\/uploads\/2019\/08\/Smart-and-Big-Data-Integration-100x45.jpg",100,45,true],"image-480":["https:\/\/e3mag.com\/wp-content\/uploads\/2019\/08\/Smart-and-Big-Data-Integration-480x216.jpg",480,216,true],"image-640":["https:\/\/e3mag.com\/wp-content\/uploads\/2019\/08\/Smart-and-Big-Data-Integration-640x288.jpg",640,288,true],"image-720":["https:\/\/e3mag.com\/wp-content\/uploads\/2019\/08\/Smart-and-Big-Data-Integration-720x324.jpg",720,324,true],"image-960":["https:\/\/e3mag.com\/wp-content\/uploads\/2019\/08\/Smart-and-Big-Data-Integration-960x432.jpg",960,432,true],"image-1168":["https:\/\/e3mag.com\/wp-content\/uploads\/2019\/08\/Smart-and-Big-Data-Integration.jpg",1000,450,false],"image-1440":["https:\/\/e3mag.com\/wp-content\/uploads\/2019\/08\/Smart-and-Big-Data-Integration.jpg",1000,450,false],"image-1920":["https:\/\/e3mag.com\/wp-content\/uploads\/2019\/08\/Smart-and-Big-Data-Integration.jpg",1000,450,false],"1536x1536":["https:\/\/e3mag.com\/wp-content\/uploads\/2019\/08\/Smart-and-Big-Data-Integration.jpg",1000,450,false],"2048x2048":["https:\/\/e3mag.com\/wp-content\/uploads\/2019\/08\/Smart-and-Big-Data-Integration.jpg",1000,450,false],"trp-custom-language-flag":["https:\/\/e3mag.com\/wp-content\/uploads\/2019\/08\/Smart-and-Big-Data-Integration.jpg",18,8,false],"bricks_large_16x9":["https:\/\/e3mag.com\/wp-content\/uploads\/2019\/08\/Smart-and-Big-Data-Integration.jpg",1000,450,false],"bricks_large":["https:\/\/e3mag.com\/wp-content\/uploads\/2019\/08\/Smart-and-Big-Data-Integration.jpg",1000,450,false],"bricks_large_square":["https:\/\/e3mag.com\/wp-content\/uploads\/2019\/08\/Smart-and-Big-Data-Integration.jpg",1000,450,false],"bricks_medium":["https:\/\/e3mag.com\/wp-content\/uploads\/2019\/08\/Smart-and-Big-Data-Integration.jpg",600,270,false],"bricks_medium_square":["https:\/\/e3mag.com\/wp-content\/uploads\/2019\/08\/Smart-and-Big-Data-Integration.jpg",600,270,false],"profile_24":["https:\/\/e3mag.com\/wp-content\/uploads\/2019\/08\/Smart-and-Big-Data-Integration-24x24.jpg",24,24,true],"profile_48":["https:\/\/e3mag.com\/wp-content\/uploads\/2019\/08\/Smart-and-Big-Data-Integration-48x48.jpg",48,48,true],"profile_96":["https:\/\/e3mag.com\/wp-content\/uploads\/2019\/08\/Smart-and-Big-Data-Integration-96x96.jpg",96,96,true],"profile_150":["https:\/\/e3mag.com\/wp-content\/uploads\/2019\/08\/Smart-and-Big-Data-Integration-150x150.jpg",150,150,true],"profile_300":["https:\/\/e3mag.com\/wp-content\/uploads\/2019\/08\/Smart-and-Big-Data-Integration-300x300.jpg",300,300,true]},"post_excerpt_stackable_v2":"<p>Beim Data Warehouse liegen die Daten in einer relationalen DB. Das ist teuer und entsprechend gibt es Produkte aus der Big-Data-Welt, die hier ansetzen. Parquet, Hive, SAP Vora und Exasol sind die bekanntesten Vertreter im SAP-Umfeld.<\/p>\n","category_list_v2":"<a href=\"https:\/\/e3mag.com\/en\/category\/opinion\/\" rel=\"category tag\">Die Meinung der SAP-Community<\/a>, <a href=\"https:\/\/e3mag.com\/en\/category\/mag-1912\/\" rel=\"category tag\">MAG 19-12<\/a>, <a href=\"https:\/\/e3mag.com\/en\/category\/opinion\/smart-big-data-integration\/\" rel=\"category tag\">Smart &amp; Big Data Integration<\/a>","author_info_v2":{"name":"Werner D\u00e4hn, rtdi.io","url":"https:\/\/e3mag.com\/en\/author\/werner-daehn\/"},"comments_num_v2":"0 comments","_links":{"self":[{"href":"https:\/\/e3mag.com\/en\/wp-json\/wp\/v2\/posts\/64047","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/e3mag.com\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/e3mag.com\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/e3mag.com\/en\/wp-json\/wp\/v2\/users\/1891"}],"replies":[{"embeddable":true,"href":"https:\/\/e3mag.com\/en\/wp-json\/wp\/v2\/comments?post=64047"}],"version-history":[{"count":0,"href":"https:\/\/e3mag.com\/en\/wp-json\/wp\/v2\/posts\/64047\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/e3mag.com\/en\/wp-json\/wp\/v2\/media\/62136"}],"wp:attachment":[{"href":"https:\/\/e3mag.com\/en\/wp-json\/wp\/v2\/media?parent=64047"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/e3mag.com\/en\/wp-json\/wp\/v2\/categories?post=64047"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/e3mag.com\/en\/wp-json\/wp\/v2\/tags?post=64047"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/e3mag.com\/en\/wp-json\/wp\/v2\/coauthors?post=64047"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}