The global and independent platform for the SAP community.

Storage Strategies & Management

If the data of a Hana application can no longer be processed by a single server node (scale-up), then the data must be distributed to multiple nodes (scale-out). Since Numa concerns access to local main memory, this article does not go into more detail about aspects of distributed systems.
Norman May, SAP
January 30, 2014
[shutterstock:313779428, Sashkin]
avatar
This text has been automatically translated from German to English.

Nevertheless, the techniques for improving performance in Numa systems and in distributed landscapes are similar.

Since SAP-As applications often manage several hundred gigabytes or even terabytes of data, efficient access to the main memory is essential. This article focuses on this aspect:

SAP presented a system at Sapphire 2012 in Orlando, which enables analytical queries on a distributed Hana-DB with 100 nodes with a total of 100 TB of aggregated main memory.

Typical distributed Hana-In a distributed server landscape, five to ten server nodes tend to be used. However, it is clear that in a distributed server landscape, the focus is more on effective distribution of data and requests and efficient network communication than on efficient access to the main memory.

What is Numa?

Modern server systems have several processors (or CPUs) on their circuit board - typically one to four CPUs in desktop systems from Intelwhereas in Intel-servers have two to eight CPUs. Processors are connected to the board via a socket.

Each of these CPUs normally contains several cores in which the calculations are carried out. Modern Intel CPUs contain two to ten cores. This means that large servers have a total of up to 80 cores available for data processing.

As the clock frequency in the cores cannot be increased any further for thermal reasons, among others, the number of cores in servers will continue to increase in the coming years. The CPUs and the main memory are connected to each other via a bus system.

If the calculations for a request can be distributed across many cores, the question arises as to how the data is transported to the cores. In principle, there are two alternatives here at the processor architecture level (Hennessy & Patterson, 2012):

  1. Symmetric Multiprocessing (SMP):
    In this architecture, the access time to a memory address is the same for all addresses and for all cores. This architecture is shown in Figure 2. Caches are assigned locally to each processor. The main memory is accessed via a Buswhich is shared by all processors. In this architecture, the Memory bus can become a bottleneck because read operations that cannot be served by the local cache and all write operations to the shared Memory bus have to access.
  2. Non-Uniform Memory Access:
    In this architecture (Figure 3), processors are assigned both caches and memory locally. For a processor, accessing the local memory is faster than accessing the memory of another processor because remote accesses via a Memory bus must be processed. For application programs, the allocation of physical memory to individual processors is not directly recognizable - they work as if in a SMP with a homogeneous address space.

Since in modern Intel-systems contain several cores in one processor, the result is a Numa architecture at the processor level, but a SMP-system at the level of each processor.

The latter is also known as a chip multi-processor (CMP). Examples of SMP-systems are Intel Pentium D, Intel Itanium, IBM PowerSun UltraSparc T2 or SGI MIPS, while examples of Numa architectures are Intel Nehalem CPUs or AMD Opteron CPUs (and their successors) are.

Is Numa relevant for Hana?

The SAPHana-DB was developed in cooperation with Intel for execution on current Intel-Xeon processors. For example, the Hana-DB the SSE-Extensions from Intel-processors to process several elements in parallel in a machine instruction.

As these Intel-processors are based on a Numa architecture, the code of the Hana-DB can be optimized for this architecture. In the following, we will discuss some scenarios where Numa effects in the Hana-DB are relevant and how the Hana-DB can handle it.

If a request is the Hana-DB, this request is first assigned to a thread. In general, threads allow lightweight concurrent processing of multiple requests (compared to processes in the operating system).

Active threads are created at a point in time on exactly one Core executed. During the processing of a query, the database must allocate memory in most cases, for example to collect the result of the query for the database application. The memory should then be allocated in the memory area assigned to the processor and the database application. Core so that memory accesses are not delayed by accesses to remote memory.

Modern operating systems already take Numa architectures into account: Both Microsoft Windows 7 or Windows Server 2008R2 as well as Linux (from kernel 2.5) attempt to allocate the memory in the area assigned to the processor of the thread or operating system process. This means that applications automatically benefit from optimizations in the operating system.

It should be noted here that virtualization solutions such as VMware ESX abstract from the physical hardware. As the software works with logical CPUs and a virtualization layer for the memory, optimizations for a Numa architecture on a virtualized system can even lead to negative effects.

The automatic memory management of the operating system can lead to undesirable effects if an application manages memory itself in order to avoid expensive system calls when allocating and releasing memory. Memory can then be present in the wrong area when the memory is reused. Virtually every database implements its own memory management that works at application level.

Another effect is that one thread allocates memory (locally), but many other threads want to work with this memory.

An example of this is the memory of a column: This memory is allocated once when the column is loaded into the main memory, but many requests read the column data.

In both of these scenarios, memory management at application level and memory access by many threads, effective scheduling of the threads can help. Here too, modern operating systems implement strategies to execute threads where the data used is allocated.

This can lead to a situation where a thread is started by a Core is moved to another so that memory accesses from local memory can be processed. However, the operating system reaches its limits where, for example, knowledge of the database system can lead to better decisions.

Numa support in Hana

The previous section discussed strategies on Numa architectures that are available to any application on modern servers and modern operating systems.

However, these techniques alone lead to suboptimal decisions because the special properties of a database system cannot be taken into account. Database-specific optimization options in the Hana-DB is covered in this section.

Memory management

As indicated above, the Hana-DB uses a memory management system based on the memory management of the operating system for reasons of efficiency. Memory is not normally returned to the operating system when it is released in the database code.

At the same time, memory that has already been allocated is reused. This is intended to reduce the fragmentation of the main memory as well as the number of system calls to the operating system. This opens up opportunities to exploit the specific properties of a Numa architecture:

1. if a thread requests memory, local memory of the processor is made available to the thread so that accesses to this memory can be made by the local memory manager.Controller are processed and the memory buses between the processors are relieved. This strategy seems particularly useful in scenarios where the memory is used by threads on the same processor.

2 In some cases, however, it seems to make more sense to distribute the requested memory across several processors. In current systems, in certain scenarios the memoryController represent a bottleneck. In these cases, it makes more sense to distribute both the memory and the threads that use the memory to different processors. In this way, the bottleneck can be avoided, for example when accessing large and frequently used columns.

Job Scheduling

In many cases, a modern operating system makes a good decision about which threads to run on which Core are to be executed.

The decision takes into account whether there are still cores on a processor that are not currently performing any calculations, the activity status of individual cores and processors (to save energy, it is worth bundling work on individual processors and deactivating other processors), whether individual processors are currently running "overclocked" (TurboBoost for Intel) and which data a thread accesses.

In addition to this automatic decision by the operating system, an application developer can influence the assignment of threads to processors or cores.

Basically, the associated optimization options for scheduling the threads in the system are dependent on memory management:

  1. If data used by a thread is assigned locally to a processor, then the thread should also be processed on this processor. Somewhat surprisingly, it is sometimes worthwhile to run more threads on a processor than the processor can process (number of cores, with hyperthreading theoretically twice the number of cores), especially if these threads access shared memory and this memory is already available in the caches.
  2. Some complex database operations read and write large amounts of data and thus place a load on the memory capacity.Controller If the data for these operations is already distributed across the local memory of several processors, then the threads for accessing the data should be distributed across several processors. This means that individual memoryController and the workload is distributed across multiple storage connections and storage devices.Controller distributed.
  3. In some cases, database operations such as join operations should be implemented with a special focus on the Numa architecture. The first "guidelines" for such implementations are discussed in the research literature (Albutiu, Kemper & Neumann, 2012).

Future of Numa

For several decades, database software was created on the premise that processing speed would increase with the next generation of processors, partly because the clock frequency would increase. For technical reasons, this automatism has no longer applied since the beginning of the millennium.

Suppliers such as Intel or AMD propagate systems in which the work is distributed across several processors, each with several cores. As discussed in the article, an architecture in which memory access is differently complex - depending on where the memory was physically allocated (referred to as Numa) - seems to be gaining acceptance.

Application software must therefore be redesigned to take advantage of the parallelism that has come with the availability of multi-core architectures. In connection with this, the software must take into account the peculiarities of the Numa architecture and optimize the allocation of memory and access to it.

While modern operating systems provide some optimizations in this area, performance-critical applications such as the Hana-DB can realize improvements that go far beyond this.

In the Hana-A number of these improvements have already been integrated into the Numa database. Nevertheless, the implementation of databases on Numa architectures is still in its infancy and further improvements are to be expected.

Write a comment

Working on the SAP basis is crucial for successful S/4 conversion. 

This gives the Competence Center strategic importance for existing SAP customers. Regardless of the S/4 Hana operating model, topics such as Automation, Monitoring, Security, Application Lifecycle Management and Data Management the basis for S/4 operations.

For the second time, E3 magazine is organizing a summit for the SAP community in Salzburg to provide comprehensive information on all aspects of S/4 Hana groundwork. All information about the event can be found here:

SAP Competence Center Summit 2024

Venue

Event Room, FourSide Hotel Salzburg,
At the exhibition center 2,
A-5020 Salzburg

Event date

June 5 and 6, 2024

Regular ticket:

€ 590 excl. VAT

Venue

Event Room, Hotel Hilton Heidelberg,
Kurfürstenanlage 1,
69115 Heidelberg

Event date

28 and 29 February 2024

Tickets

Regular ticket
EUR 590 excl. VAT
The organizer is the E3 magazine of the publishing house B4Bmedia.net AG. The presentations will be accompanied by an exhibition of selected SAP partners. The ticket price includes the attendance of all lectures of the Steampunk and BTP Summit 2024, the visit of the exhibition area, the participation in the evening event as well as the catering during the official program. The lecture program and the list of exhibitors and sponsors (SAP partners) will be published on this website in due time.