The global and independent platform for the SAP community.

Big Data Architecture

Big Data is a big topic, but the multitude of possibilities is overwhelming. Every software provider comes up with different products and different goals. I would like to bring some structure into this jungle and make it easier to get started.
Werner Dähn, rtdi.io
October 2, 2019
Smart and Big Data Integration
avatar
This text has been automatically translated from German to English.

As a software architect, my goal is to achieve complicated tasks via simple solutions. The individual components of a solution each have advantages and disadvantages, the art is to combine them in such a way that in sum the advantages remain and the disadvantages cancel each other out.

For many SAP users, the first step will be to enable analytics with Big Data, that is, to find interesting information in these huge volumes of data.

But instead of building a completely new infrastructure for the users, I combine the Big Data system with the existing data warehouse.

The data scientist gets the data lake, a data area in which all the raw data is available, and a powerful tool to go with it, with which he can also process this raw data. The result of his work is new key figures that I add to the data warehouse. This has several advantages:

  • The business user continues to use his usual tools for analysis, only now he has more key figures.
  • The Data Scientist has access to all data, Big Data and ERP data.
  • For IT, the effort is manageable.

This solution is also attractive in the context of costs vs. benefits vs. probabilities of success: By docking on to the existing, I have a reduced project scope, thus a minimized project risk and a cheaper implementation, but still fully exploit the potential benefits.

Thus, a Big Data solution consists of only two components: the data lake with the raw data and a server cluster where the data preparation takes place.

Data Lake or SAP Vora

In the past, SAP offered SAP Vora as a data lake and sells the Altiscale solution under the name Big Data Services. Basically, however, the data lake is just a large file system. If SAP sales nevertheless propose Vora, Altiscale or DataHub, the price and performance should be scrutinized very critically.

Why not just start with a local hard disk or the central file server in the first project phase? As long as there is enough space and the costs for the storage space are not too high, this is valid throughout. Copying the files is possible at any time and without any problems, so I don't block anything for the future.

Preparation with Apache Spark

For the processing of this data, most projects today use the open source framework Apache Spark. It allows programs for data processing to be written with just a few lines of code and executed in parallel in a server cluster.

There is no reason for me to reinvent the wheel here, especially since such an installation is very simple and can be done in ten minutes: download the package on a small Linux machine, extract it and start a master and a first worker via the start-all command.

Challenge: Algorithm

The technology is manageable with the above approach. Developing the algorithms for the new key figures is the difficult part: How can information be extracted from the mass data that will ultimately be reflected in the company's profit?

This is precisely where the success of a Big Data project is decided. That's why I think it makes sense to invest here, for example in the training of a data scientist.

In the following columns, I will answer the following questions, among others: Why use Apache Spark and not an ETL tool? Why do you need the data lake if the data is already in the data warehouse? Etc.

avatar
Werner Dähn, rtdi.io

Werner Dähn is Data Integration Specialist and Managing Director of rtdi.io.


Write a comment

Working on the SAP basis is crucial for successful S/4 conversion. 

This gives the Competence Center strategic importance for existing SAP customers. Regardless of the S/4 Hana operating model, topics such as Automation, Monitoring, Security, Application Lifecycle Management and Data Management the basis for S/4 operations.

For the second time, E3 magazine is organizing a summit for the SAP community in Salzburg to provide comprehensive information on all aspects of S/4 Hana groundwork.

Venue

More information will follow shortly.

Event date

Wednesday, May 21, and
Thursday, May 22, 2025

Early Bird Ticket

Available until Friday, January 24, 2025
EUR 390 excl. VAT

Regular ticket

EUR 590 excl. VAT

Venue

Hotel Hilton Heidelberg
Kurfürstenanlage 1
D-69115 Heidelberg

Event date

Wednesday, March 5, and
Thursday, March 6, 2025

Tickets

Regular ticket
EUR 590 excl. VAT
Early Bird Ticket

Available until December 20, 2024

EUR 390 excl. VAT
The event is organized by the E3 magazine of the publishing house B4Bmedia.net AG. The presentations will be accompanied by an exhibition of selected SAP partners. The ticket price includes attendance at all presentations of the Steampunk and BTP Summit 2025, a visit to the exhibition area, participation in the evening event and catering during the official program. The lecture program and the list of exhibitors and sponsors (SAP partners) will be published on this website in due course.