The global and independent platform for the SAP community.

What is Big Data?

Many companies are just starting out with Big Data. They have initial ideas. The potential is being explored. SAP also has different approaches to the topic, depending on which department you talk to.
Werner Dähn, rtdi.io
31 October 2019
Smart and Big Data Integration
avatar
This text has been automatically translated from German to English.

The biggest hurdle at the beginning is the term Big Data itself. The direct translation mass data unfortunately only hits one aspect. All normal data from the ERP system and other databases are also mass data.

In terms of volume, we must therefore speak of quantities that are too large for databases - too large in the absolute sense or in the sense of costs and benefits. Another aspect is the degree of structure in the data.

The ERP system contains 99 percent well-structured data. The one percent is free text like a delivery note. With Big Data, it's the other extreme and the exciting information is in the unstructured data areas. When and where a photo was taken is interesting, but what the picture shows is infinitely more important.

This is also accompanied by the type of data preparation. Whereas with databases it is a query such as "Total sales per month", with the above examples we are suddenly talking about image analysis and text analysis.

The most important definition of Big Data, however, is "all data that is not being used today to increase company profits." Creativity is the name of the game here.

One of my past projects has involved tracking server utilization in the data center - with the goal of reducing the number of servers. To illustrate this, I would like to bring an example.

Sales are to be linked with information on how intensively customers have viewed the respective product on the website. For example, a product is advertised in the media. Is this advertising perceived?

If so, we should see increased traffic on the associated product pages. Do prospective customers read the product page briefly, are immediately convinced and then buy? Or do they read the technical data very carefully and then not buy after all?

Once you have an idea of which data should be analyzed with Big Data, the question of a promising architecture arises. Especially in the Big Data area, new products are constantly being developed to replace the old. A few years ago, Map Reduce on Hadoop was the ultimate, then came Apache Spark, which has better performance and greater power.

For a long time Apache Hive was the way to go, today it's Parquet Files. In such a dynamic environment, I don't want to spend a lot of money on a potentially short-term solution, and I also want to have the openness to switch to something new at any time.

Apache Spark fits this desire for a powerful but at the same time open solution and is therefore used in almost every project worldwide.

Installation is easy, complex transformations are possible with fewer lines of code, and the software costs nothing. The big costs would be in building a BI system for it.

So instead, I add the metrics calculated with Spark to the existing data warehouse and allow users to perform new analyses with the old familiar tools - for example, for a product now correlating sales additionally with reading duration and page hits.

Conclusion and future: Until recently, storing and processing such secondary data was not attractive in terms of price. The volume of data was too large, the information density too low, and the only way to process data effectively was with DB-related tools.

These arguments no longer apply today. With the Apache Hadoop Filesystem (HDFS), large filesystems can be formed from cheap PC components instead of buying an expensive disk array.

Apache Spark can process these large data sets, with associated complex algorithms including statistical methods and machine learning.

And the solution: Tools from the data warehouse sector, including those from SAP, have adapted to this situation and offer direct access to Hadoop files or send transformation tasks to a connected Spark cluster. One of these misunderstood gems is the SAP Hana Spark Connector.

avatar
Werner Dähn, rtdi.io

Werner Dähn is Data Integration Specialist and Managing Director of rtdi.io.


Write a comment

Working on the SAP basis is crucial for successful S/4 conversion. 

This gives the Competence Center strategic importance for existing SAP customers. Regardless of the S/4 Hana operating model, topics such as Automation, Monitoring, Security, Application Lifecycle Management and Data Management the basis for S/4 operations.

For the second time, E3 magazine is organizing a summit for the SAP community in Salzburg to provide comprehensive information on all aspects of S/4 Hana groundwork.

Venue

More information will follow shortly.

Event date

Wednesday, May 21, and
Thursday, May 22, 2025

Early Bird Ticket

Available until Friday, January 24, 2025
EUR 390 excl. VAT

Regular ticket

EUR 590 excl. VAT

Venue

Hotel Hilton Heidelberg
Kurfürstenanlage 1
D-69115 Heidelberg

Event date

Wednesday, March 5, and
Thursday, March 6, 2025

Tickets

Regular ticket
EUR 590 excl. VAT
Early Bird Ticket

Available until December 24, 2024

EUR 390 excl. VAT
The event is organized by the E3 magazine of the publishing house B4Bmedia.net AG. The presentations will be accompanied by an exhibition of selected SAP partners. The ticket price includes attendance at all presentations of the Steampunk and BTP Summit 2025, a visit to the exhibition area, participation in the evening event and catering during the official program. The lecture program and the list of exhibitors and sponsors (SAP partners) will be published on this website in due course.