The Opinion of the SAP Community, MAG 19-12, Smart & Big Data Integration

The technology of a data lake

In a data warehouse, the data is stored in a relational database. This is expensive and accordingly there are products from the Big Data world that start here. Parquet, Hive, SAP Vora and Exasol are the best-known representatives in the SAP environment.

Werner Dähn, rtdi.io

January 9, 2020

Content:

To the comments (0)

This text has been automatically translated from German to English.

In general, I would divide the data storage options into three categories. Files: The data is stored as simple files and used like tables.

These files should have information about the structure and should also be indexed. The Parquet file format is a representative of this category.

Database process: Instead of working directly with the files, there is an active service on top that feels like a database. It takes care of caching frequently used data and can be queried via ODBC/JDBC. A typical representative of this type in the big data world is Apache Hive.

In-memory: For maximum performance, all data is stored in memory, indexed and used to build something similar to Hana. Exasol and SAP Vora work according to this principle.

The big data world lives solely from the fact that many small (and therefore inexpensive) servers form an overall system. This allows you to scale infinitely and the hardware costs only increase linearly.

But the more nodes form the overall system, the more expensive their synchronization becomes. A link ("join") of three or even more tables can mean that each node has to fetch the appropriate intermediate results from the previous join and the query runs for hours.

This problem is called "reshuffle". Of course, the fact that the data is stored in memory does not help when redistributing the intermediate results via the network.

Hana, on the other hand, is a real database. It is extremely fast when searching. The join performance is great, you have full transaction consistency when reading and writing. All of this requires a lot of synchronization.

However, such a database does not scale infinitely. Many projects solve the "reshuffle" dilemma by storing the data in an optimized way for certain queries. This in turn reduces flexibility and increases costs, i.e. precisely the points that were actually intended as advantages of a data lake.

The synchronization effort of transaction consistency is a logical problem. It cannot be solved without imposing softer requirements, such as "eventual consistency".

This problem is known as the CAP theorem. Of the three requirements Consistency-Availability-Partitioning, all of the points can never be achieved, especially in the event of an error.

A highly available and distributed system must make compromises in terms of data consistency, while a transactional database system must make compromises in terms of availability or scalability.

The data available in Big Data is raw data that becomes information through non-SQL transformations - so a Big Data-based data warehouse with SQL queries makes no sense.

The data lake is the playground for the data scientist. This person has easy access to data that was previously deleted or was difficult to access.

The data scientist can deal with all the problems that arise from big data technology: Semantics of the data; slow performance; and, what data is there. Mixing big data and business data? No problem for him.

Coupling Hana with Vora makes little sense from this point of view. Both store the data in-memory and allow fast searches - with corresponding costs. Both have warm storage on disk (Sybase database), both focus on SQL queries. Vora is also no longer on SAP's price list as a stand-alone product.

Parquet files and a database, on the other hand, complement each other perfectly. The parquet files in a data lake cost practically nothing to store, whereas storage space in the database is expensive.

A database like Hana is excellent for joins and complicated SQL queries, but for a compute cluster these operations are the most complex.

The combination of the two results in fast business intelligence queries and convenient access to all raw data. Both contribute their strengths.

Werner Dähn, rtdi.io

Werner Dähn is Data Integration Specialist and Managing Director of rtdi.io.

All articles of the author

The technology of a data lake

Write a comment (Cancel Reply)

The AI clock is ticking in low-code mode

Three steps from a „dirty“ SAP system back to a core system that complies with standards

SAP sovereignty versus vendor lock-in

Venue

Event date

Early Bird Ticket

Regular ticket

Subscribers to the E3 Magazine Ticket

Students*

Venue

Event date

The technology of a data lake

Write a comment (Cancel Reply)

The AI clock is ticking in low-code mode

Three steps from a „dirty“ SAP system back to a core system that complies with standards

SAP sovereignty versus vendor lock-in

Venue

Event date

Early Bird Ticket

Regular ticket

Subscribers to the E3 Magazine Ticket

Students*

Venue

Event date

Tickets