The Opinion of the SAP Community, MAG 19-09, Smart & Big Data Integration

Big Data Architecture

Big Data is a big topic, but the multitude of possibilities is overwhelming. Every software provider comes up with different products and different goals. I would like to bring some structure into this jungle and make it easier to get started.

Werner Dähn, rtdi.io

October 2, 2019

Content:

To the comments (0)

This text has been automatically translated from German to English.

As a software architect, my goal is to achieve complicated tasks via simple solutions. The individual components of a solution each have advantages and disadvantages, the art is to combine them in such a way that in sum the advantages remain and the disadvantages cancel each other out.

For many SAP users, the first step will be to enable analytics with Big Data, that is, to find interesting information in these huge volumes of data.

But instead of building a completely new infrastructure for the users, I combine the Big Data system with the existing data warehouse.

The data scientist gets the data lake, a data area in which all the raw data is available, and a powerful tool to go with it, with which he can also process this raw data. The result of his work is new key figures that I add to the data warehouse. This has several advantages:

The business user continues to use his usual tools for analysis, only now he has more key figures.
The Data Scientist has access to all data, Big Data and ERP data.
For IT, the effort is manageable.

This solution is also attractive in the context of costs vs. benefits vs. probabilities of success: By docking on to the existing, I have a reduced project scope, thus a minimized project risk and a cheaper implementation, but still fully exploit the potential benefits.

Thus, a Big Data solution consists of only two components: the data lake with the raw data and a server cluster where the data preparation takes place.

Data Lake or SAP Vora

In the past, SAP offered SAP Vora as a data lake and sells the Altiscale solution under the name Big Data Services. Basically, however, the data lake is just a large file system. If SAP sales nevertheless propose Vora, Altiscale or DataHub, the price and performance should be scrutinized very critically.

Why not just start with a local hard disk or the central file server in the first project phase? As long as there is enough space and the costs for the storage space are not too high, this is valid throughout. Copying the files is possible at any time and without any problems, so I don't block anything for the future.

Preparation with Apache Spark

For the processing of this data, most projects today use the open source framework Apache Spark. It allows programs for data processing to be written with just a few lines of code and executed in parallel in a server cluster.

There is no reason for me to reinvent the wheel here, especially since such an installation is very simple and can be done in ten minutes: download the package on a small Linux machine, extract it and start a master and a first worker via the start-all command.

Challenge: Algorithm

The technology is manageable with the above approach. Developing the algorithms for the new key figures is the difficult part: How can information be extracted from the mass data that will ultimately be reflected in the company's profit?

This is precisely where the success of a Big Data project is decided. That's why I think it makes sense to invest here, for example in the training of a data scientist.

In the following columns, I will answer the following questions, among others: Why use Apache Spark and not an ETL tool? Why do you need the data lake if the data is already in the data warehouse? Etc.

Werner Dähn, rtdi.io

Werner Dähn is Data Integration Specialist and Managing Director of rtdi.io.

All articles of the author

Big Data Architecture

Data Lake or SAP Vora

Preparation with Apache Spark

Challenge: Algorithm

Write a comment (Cancel Reply)

Agentic AI in the Starting Blocks

Mind the Gap With SAP

Millions for SAP Ice Hockey Center

Venue

Event date

Early Bird Ticket

Regular ticket

Subscribers to the E3 Magazine Ticket

Students*

Venue

Event date

Big Data Architecture

Data Lake or SAP Vora

Preparation with Apache Spark

Challenge: Algorithm

Write a comment (Cancel Reply)

Agentic AI in the Starting Blocks

Mind the Gap With SAP

Millions for SAP Ice Hockey Center

Venue

Event date

Early Bird Ticket

Regular ticket

Subscribers to the E3 Magazine Ticket

Students*

Venue

Event date

Tickets