AI—From Start to Finish


Although it has not been proven in all cases, the assumption seems obvious: despite virtual prohibition signs and paywalls, the operators of large language models (LLMs) are plundering the wealth of data on the internet. AI pioneers are among the world's best computer scientists, and for them, circumventing any hurdle or barrier is child’s play.
There is a WWW etiquette: at the beginning of a website's HTML code, a virtual entry ban for bots and crawlers can be programmed. This barrier can be useful for various reasons. For example, if a website is under construction and still contains test data, it makes little sense for a Google crawler to index these pages. A web crawler, also called a spider or bot, is an automated program that searches the internet to collect and index content from websites. Crawlers follow hyperlinks to discover new web pages and store information such as titles, images, and keywords to create searchable indexes for search engines like Google and Bing.
This prohibition sign for web crawlers at the beginning of a website can, of course, also be used to protect your own content. The prerequisite, of course, is compliance with WWW etiquette. In other words, any protection can be circumvented with more sophisticated programming. Numerous experiments prove that the web crawlers of major IT companies regularly bypass virtual prohibition signs to train their LLMs.
Authors, journalists, artists, photographers, and all content producers consider this circumvention of a technical barrier to be copyright infringement and theft of intellectual property. There are preliminary legal opinions and court rulings on this issue in the US. In short, some US judges believe that these signs can be circumvented for AI training purposes. However, this does not mean that these texts and photos may be used in AI responses and results. This may be legally tenable, but it contradicts human sensibilities.
For training purposes, the AI can read E3 magazines but cannot quote them. A summary of E3 is probably sufficient to help someone seeking assistance from the SAP community, which the AI can do well with the training data. There is no need for verbatim quotes—the cat is out of the bag anyway, right?
Ultimately, it's a financial issue. Anyone who used E3 content commercially had a business relationship with the publisher. This ensured the give-and-take that is so important in the SAP community. New sources could also emerge. However, if AI now plunders E3 sources without providing anything in return, there is a risk that E3 and many other independent SAP sources will dry up.
In a few years, only official SAP websites and the SAP User Group's web presence may be available for training large language models. The results will then be more modest.





