The global and independent platform for the SAP community.

The technology of a data lake

In a data warehouse, the data is stored in a relational database. This is expensive and accordingly there are products from the Big Data world that start here. Parquet, Hive, SAP Vora and Exasol are the best-known representatives in the SAP environment.
Werner Dähn, rtdi.io
January 9, 2020
Smart and Big Data Integration
avatar
This text has been automatically translated from German to English.

In general, I would divide the data storage options into three categories. Files: The data is stored as simple files and used like tables.

These files should have information about the structure and should also be indexed. The Parquet file format is a representative of this category.

Database process: Instead of working directly with the files, there is an active service on top that feels like a database. It takes care of caching frequently used data and can be queried via ODBC/JDBC. A typical representative of this type in the big data world is Apache Hive.

In-memory: For maximum performance, all data is stored in memory, indexed and used to build something similar to Hana. Exasol and SAP Vora work according to this principle.

The big data world lives solely from the fact that many small (and therefore inexpensive) servers form an overall system. This allows you to scale infinitely and the hardware costs only increase linearly.

But the more nodes form the overall system, the more expensive their synchronization becomes. A link ("join") of three or even more tables can mean that each node has to fetch the appropriate intermediate results from the previous join and the query runs for hours.

This problem is called "reshuffle". Of course, the fact that the data is stored in memory does not help when redistributing the intermediate results via the network.

Hana, on the other hand, is a real database. It is extremely fast when searching. The join performance is great, you have full transaction consistency when reading and writing. All of this requires a lot of synchronization.

However, such a database does not scale infinitely. Many projects solve the "reshuffle" dilemma by storing the data in an optimized way for certain queries. This in turn reduces flexibility and increases costs, i.e. precisely the points that were actually intended as advantages of a data lake.

The synchronization effort of transaction consistency is a logical problem. It cannot be solved without imposing softer requirements, such as "eventual consistency".

This problem is known as the CAP theorem. Of the three requirements Consistency-Availability-Partitioning, all of the points can never be achieved, especially in the event of an error.

A highly available and distributed system must make compromises in terms of data consistency, while a transactional database system must make compromises in terms of availability or scalability.

The data available in Big Data is raw data that becomes information through non-SQL transformations - so a Big Data-based data warehouse with SQL queries makes no sense.

The data lake is the playground for the data scientist. This person has easy access to data that was previously deleted or was difficult to access.

The data scientist can deal with all the problems that arise from big data technology: Semantics of the data; slow performance; and, what data is there. Mixing big data and business data? No problem for him.

Coupling Hana with Vora makes little sense from this point of view. Both store the data in-memory and allow fast searches - with corresponding costs. Both have warm storage on disk (Sybase database), both focus on SQL queries. Vora is also no longer on SAP's price list as a stand-alone product.

Parquet files and a database, on the other hand, complement each other perfectly. The parquet files in a data lake cost practically nothing to store, whereas storage space in the database is expensive.

A database like Hana is excellent for joins and complicated SQL queries, but for a compute cluster these operations are the most complex.

The combination of the two results in fast business intelligence queries and convenient access to all raw data. Both contribute their strengths.

avatar
Werner Dähn, rtdi.io

Werner Dähn is Data Integration Specialist and Managing Director of rtdi.io.


Write a comment

Working on the SAP basis is crucial for successful S/4 conversion. 

This gives the Competence Center strategic importance for existing SAP customers. Regardless of the S/4 Hana operating model, topics such as Automation, Monitoring, Security, Application Lifecycle Management and Data Management the basis for S/4 operations.

For the second time, E3 magazine is organizing a summit for the SAP community in Salzburg to provide comprehensive information on all aspects of S/4 Hana groundwork.

Venue

More information will follow shortly.

Event date

Wednesday, May 21, and
Thursday, May 22, 2025

Early Bird Ticket

Available until Friday, January 24, 2025
EUR 390 excl. VAT

Regular ticket

EUR 590 excl. VAT

Venue

Hotel Hilton Heidelberg
Kurfürstenanlage 1
D-69115 Heidelberg

Event date

Wednesday, March 5, and
Thursday, March 6, 2025

Tickets

Regular ticket
EUR 590 excl. VAT
Early Bird Ticket

Available until December 24, 2024

EUR 390 excl. VAT
The event is organized by the E3 magazine of the publishing house B4Bmedia.net AG. The presentations will be accompanied by an exhibition of selected SAP partners. The ticket price includes attendance at all presentations of the Steampunk and BTP Summit 2025, a visit to the exhibition area, participation in the evening event and catering during the official program. The lecture program and the list of exhibitors and sponsors (SAP partners) will be published on this website in due course.