Citizen Data Scientists

Big Data and Data Science have been on everyone's lips for a few years now. Data scientists are desperately sought after to help discover new relationships in the data using mathematical and statistical methods.

Prof. Dr.-Ing. Peter Lehmann

June 7, 2018

Content:

To the comments (0)

This text has been automatically translated from German to English.

Business analysts move through the structured data models of a data warehouse. They usually know the data models well and understand how to use front-end tools (Excel, Tableau, SAP BO) to create queries on the data models in order to cover their information requirements.

The complexity of the database structures is concealed by modern tools, which automatically generate the programming code required for the queries and thus enable a certain degree of independence from the IT professionals.

Business analysts have often studied business administration, economics or business informatics and work in the specialist departments or at the interface between the specialist department and IT.

Data lakes instead of warehouses

"Data is the new oil." This frequently used slogan describes the importance of data for the advancing digitalization in all areas of life.

Data is collected everywhere, from the use of smartphones and the sensors in our vehicles to the coffee machine app that automatically reorders the capsules.

Instead of flowing into the organized structures of a data warehouse, the data now flows into a so-called data lake. A data lake is a data storage facility that stores large volumes of data in their original format until they are needed at some point.

As there is no predefined data schema, a lot of metadata is stored for the data until a data request is defined at some point.

For example, if a business issue arises, the data lake can be searched for relevant data and the resulting data volume can then be analyzed in a targeted manner to help solve the business problem.

In specialist literature, "information" is often defined as knowledge that is relevant to decision-making or action. Data therefore becomes "information" when it helps to solve a problem or bring about a decision.

Interdisciplinary knowledge and AI

This is precisely the purpose of the data warehouse. The data is structured and processed in such a way that the user can meet their information requirements independently.

The information structures are initially missing from the data lake and must first be discovered and processed by experts. Special IT skills are required to discover these structures and correlations, which very often require mathematical and statistical methods that also need to be embedded in programming languages such as R or Python.

Assistance is provided by machine learning using methods from the discipline of artificial intelligence. It is obvious that this requires mathematicians, computer scientists, natural scientists or technicians (STEM) with a good theoretical background.

Not only are STEM graduates very difficult to come by, but the fact that they also have little knowledge of business administration makes discovering new correlations in the data in the data lake a major problem.

It therefore makes sense to provide well-trained and experienced business analysts with further training in selected methods from the field of data science and to procure specialized tools that support these methods with easy-to-use user guidance.

The market research company Gartner coined the term citizen data scientist1 in an article back in 2015. Gartner refers to a convergence of business analytics and predictive analytics that can help organizations close the gap between complex mathematical analysis functions and artificial intelligence methods.

This will also enable companies to make significant progress along the maturity curve of business analytics. Convergence will help predictive analytics reach a broader audience of business analysts and thus citizen data scientists.

A Citizen Data Scientist (CDS) is more than just an experienced Excel user who knows how to analyze pivot tables. A CDS is able to methodically map the business question to the data science process, understand the critical importance of data quality for machine learning, evaluate and use different tools.

They must not be afraid of a programming language. This is less about programming complex applications and more about scripting small program parts and using and parameterizing existing algorithms.

New type of data indexing

The process of developing new knowledge for the data scientist changes completely. In traditional data warehousing, a multidimensional model is first created in collaboration between the business department and IT and a schema for a data mart is developed from this.

The schema basically consists of key figures and attributes that are related to the key figures. Dimensions and hierarchies are further structural features that help to structure the requirements of business users.

The structures are then filled using an extraction, transformation and loading process. Whether an SAP Hana schema or an SAP BW info provider is filled, for example, is really only a technical question.

The focus is on a schema agreed with the specialist department that is filled with data. A data scientist, on the other hand, takes a completely different approach. The data from their data sources often does not initially have a predefined or obvious structure.

For example, CSV files with sensor data, texts from social media or geodata from a smartphone app are stored in the file system of a data lake. If a specialist user now turns to the data scientist with their information requirements, a data exploration process is triggered, at the end of which a data structure is created that is suitable for analysis tools.

Whether this is a data mining or predictive maintenance application is important at this point, but not decisive.

During the data development process, a dataset is first created that is "presumably" suitable for data analysis. This is where the problems begin. This dataset should initially be representative, i.e. contain characteristics and data that represent the application scenario as well as possible.

"All data" is usually not suitable for analysis applications, as too many outliers and peculiarities would distort the results. The data is then transformed so that it "fits" the needs of the analysis tools.

The quality of the data plays a decisive role here. A data scientist's analysis tool "learns" based on the data it has to examine. The quality of the data is fundamentally irrelevant to the tool. For example, it "learns" that an above-average number of subscribers to a newsletter come from "Afghanistan".

The reason is obviously that "Afghanistan" is at the top of the list of countries in the login screen of a website. Unfortunately, such data constellations are often not as obvious as in this example.

Data quality is crucial

In a research project as part of a collaboration between Stuttgart Media University and Uniserv, it was shown how easily the measured value for the quality of an analysis tool can be invalidated due to poor data quality.

To this end, scenarios were created as part of a partnership that carried out learning with both high-quality data and poor-quality data. The method that generates high-quality data is referred to as "ground truth".

This term was originally coined by MIT and further developed as part of the research project. Customer master data was enriched with transaction data that had previously passed through a data quality control system.

This creates a data set that contains both master data and transaction data, thereby creating a precise profile of a customer. At the end of the data development process, there is a data structure that is passed on to the analysis tool.

Results are often generated that are initially unsatisfactory. Additional attributes or other data must then be added to the data pool for analysis.

The data development process for creating the ground truth therefore starts from the beginning. Software components for the data development process therefore play a crucial role for the data scientist. Without suitable data quality, reliable predictive analytics is not possible.

Due to the urgent need to understand data as a corporate asset and to harness its potential, more and more public universities and companies are offering Citizen Data Scientist training courses.

It is important to ensure that there is a balanced mix of theory and practical application scenarios with hands-on experience. Exchanging ideas with business analysts from other companies as part of the team should be a matter of course, as should the use of IT systems from different manufacturers.

The Stuttgart Media University offers an application-oriented, professional training program to become a Citizen Data Scientist.

[1] Predicts 2015: A Step Change in the Industrialization of Advanced Analytics, https://www.gartner.com/doc/2930917/predicts-step-change-industrialization, accessed on
February 26, 2018

Prof. Dr.-Ing. Peter Lehmann

Prof. Dr.-Ing. Peter Lehmann is Professor of Information Systems, in particular Business Intelligence, at the Stuttgart Media University.

All articles of the author

Citizen Data Scientists

Data lakes instead of warehouses

Interdisciplinary knowledge and AI

New type of data indexing

Data quality is crucial

Write a comment (Cancel Reply)

Dual SAP Rise license

RISE, Cloud, and Exit

Away with paper and Z programs: Digitized production with SAP DM

Venue

Event date

Early Bird Ticket

Regular ticket

Venue

Event date

Citizen Data Scientists

Data lakes instead of warehouses

Interdisciplinary knowledge and AI

New type of data indexing

Data quality is crucial

Write a comment (Cancel Reply)

Dual SAP Rise license

RISE, Cloud, and Exit

Away with paper and Z programs: Digitized production with SAP DM

Venue

Event date

Early Bird Ticket

Regular ticket

Venue

Event date

Tickets