A Guide to SAP Data Integration

Software development takes time, so you have to plan the solution before the first customers even ask for it. When I invented Hana Smart Data Integration (SDI), no one had such a solution on their radar - ETL tools were good enough.

Werner Dähn, rtdi.io

September 23, 2020

Content:

System vs. data integration

Batch vs. real-time

Apache Kafka

Message Queue vs. Kafka Transaction Log

To the comments (0)

[shutterstock.com: 1259435662, HAKINMHAN]

This text has been automatically translated from German to English.

Today, SDI is used in all Hana solutions and customers take it for granted. With the next evolutionary step, I wanted to bring data and process integration together in a way that hadn't been done before. The technology was finally far enough along to be able to merge the two product categories. However, this clashed with SAP's organizational structure and was not taken up.

Today, the integration issue has boiled up to the board of directors and the result is not exactly convincing. Worse, I see many existing SAP customers solving this integration issue for themselves in a clever way, meaning they are already further along than SAP itself. These customers get the finishing touch from the open source solution my company provides.

In the following, I would like to take you on a mental journey of what insights these pioneers had and where we have improved on them.

System vs. data integration

Looking at SAP's product portfolio, there is a sharp distinction between process integration and data integration. On the one hand, you want to connect the ERP system with another application; on the other hand, you want to transfer table contents from A to B. If you ask an application developer, they talk about "entities" like the business partner. In data integration, you go one level deeper, the tables. This separation does not exist in the Big Data world. There, all products can deal with deeply nested objects and a database table is just a particularly simple object. This leads to the first conclusion: Let's forget about database tools and take a better look at Big Data products.

Batch vs. real-time

Next, there are tools that transfer mass data in batch, and others are built for realtime. This separation has technical reasons, but from a purely logical point of view, realtime is a superset of the two. In batch, you can never transfer data at arbitrarily short intervals. With a realtime system, however, batch processing is possible. This looks like nothing happens in a source for hours and then suddenly - for a short interval - a lot of data is generated. For this, however, the real-time tool must be able to handle mass data - which would bring us back to the Big Data portfolio.

If you look at the solution from the perspective of which systems are coupled with which, in the past it was more of a one-to-one relationship. The SAP ERP data goes to SAP BW. Time and attendance data ends up as postings in the SAP HCM module. And that's exactly how SAP tools are built. If this assumption was not necessarily correct in the past, today a great many consuming systems are connected to each source system - and the trend is increasing. For example, ERP data goes into SAP BW, a data lake, Ariba, Salesforce, and countless other intelligent enterprise apps.

Thus, even with data orchestration, as is common with all SAP tools, you won't get very far. It makes more sense if every consumer can help himself to the data at will, i.e. a data choreography. In such a setup, the conductor no longer dictates who has to do what and when, but there is a channel for each object in which systems can publish changes and other systems can consume the changes as they see fit.

For example, the ERP would publish the latest version every time a business partner entry is changed, and the BW consumes it once a day in one go. Another application, on the other hand, constantly listens for changes in this topic and can integrate them into its own application with a latency in the millisecond range.

Apache Kafka

If you put all these thoughts together, you inevitably end up with Apache Kafka. This is not the only reason why Kafka is now used by all large companies and is becoming more and more established as a standard. If it works for the Big Data world, we can certainly make good use of it for operational data, right?

Apache Kafka consists in its core of "Topics", which represent the data channel. Each of these topics can be partitioned in itself to allow parallelization of bulk data. And each change message has a schema with associated data. So, in our example, there is a schema "Business Partner" with the master data such as first name and last name and all the addresses of the customer are nested in it. If you look at this from a data integration perspective, these are the SAP ERP tables KNA1 with the associated ADRC address data. In process integration, the nested structure is used, for example via SAP IDocs or Bapis.

This means extra work for the one (!) producer, but it makes it much easier for the many consumers. In a world where there are many consumers for each area, this is the more cost-effective way overall.

But it's not enough to simply hand over every IDoc to Kafka, for example - and behind me, the deluge. If anything, the full potential should be mobilized. One such opportunity revolves around changes to structure - the death of any current integration solution. Neither is it viable to adapt all Producers and Consumers synchronously, nor does it make sense to keep multiple versions of the structure at the same time. That's why I follow the concept of schema evolution, the ability to extend a schema without breaking anything.

The simplest case is easily explained: Assume there are two producers and ten consumers for business partner master data. The one producer, the SAP system, has received an additional Z field today. The SAP producer inserts this into the official schema and gives it a default value . From now on, the SAP system can also send this field.

The other producer continues to use the previous schema version for the next 20 minutes until he resynchronizes. Switching to the new scheme doesn't throw him off, though, because this field doesn't exist for him, so he doesn't fill it, so it stays at . Nothing needs to be changed on this producer, it just keeps running as is.

If the consumers receive the new schema variant for the first time, it will be used to read all messages from now on. Thus, the additional field is always present. If an old message is read via the new schema, the Z field is not in the data and therefore . So also in this case there are no complications.

The consumers, in turn, can decide for themselves how to handle the new field. An application consumer only gets the fields from the schema that it really needs anyway, and the Z field has no equivalent in the target application at the moment. A data lake consumer probably extends the target structure with this additional field automatically, so as never to lose information.

Schema evolution thus allows the official schema to be successively adapted over time. Then there are cases where the producer wants to send along technical information. For this purpose, each schema has an extension area reserved.

In general, the schema contains some more information that can be interesting later on: What is the source system of the message? Which transformations have the data been subjected to? How is the data quality of the record to be assessed?

Message Queue vs. Kafka Transaction Log

The case where a consumer saw the additional field but could not use it highlights another, unsolved problem: How do you get the already loaded data again? Before Kafka's time, message queues would have been used, and there the only way to get all the data again is to have the source produce it again. However, with that they flow through all consumers, even those that have no interest in it at all. If the next consumer is adjusted, all data must be produced again. What a horror. That's why message queues never caught on as originally expected.

However, the premise of our solution was that the consumer should be able to decide what to read and when. In this case, he should also have the option of being able to re-read data that has already been read. Practically, you would change this consumer as desired and tell it to please reread the data of the past seven days when it restarts. Unlike message queues, Kafka does not throw away the data immediately, but is built as a Big Data tool to hold the change messages for a while or even forever.

This option is an immense benefit for numerous other situations. For example, the developer can repeat the same tests any number of times and get the same change data. Or a new consumer doesn't start without data, but gets a large amount of data at the first call.

If you are also looking for an affordable, open and forward-looking solution for the integration of your various applications, you can find inspiration on my website.

Werner Dähn, rtdi.io

Werner Dähn is Data Integration Specialist and Managing Director of rtdi.io.

All articles of the author

A Guide to SAP Data Integration

System vs. data integration

Batch vs. real-time

Apache Kafka

Message Queue vs. Kafka Transaction Log

Write a comment (Cancel Reply)

The hidden ROI of BTP: Why CFOs will love the clean core

Artificial intelligence: Digital 2030 trend study

SAP can do everything

Venue

Event date

Regular ticket

Venue

Event date

A Guide to SAP Data Integration

System vs. data integration

Batch vs. real-time

Apache Kafka

Message Queue vs. Kafka Transaction Log

Write a comment (Cancel Reply)

The hidden ROI of BTP: Why CFOs will love the clean core

Artificial intelligence: Digital 2030 trend study

SAP can do everything

Venue

Event date

Regular ticket

Venue

Event date

Tickets