AI – from source to sink


It has not been proven in all cases, but the assumption seems obvious: despite virtual prohibition signs and paywalls, the operators of large language models (LLMs) are „plundering“ the wealth of data on the internet. The AI pioneers are among the best computer scientists in the world, so it should be easy for them to circumvent any hurdle or barrier.
There is a WWW etiquette: At the beginning of the HTML code of a website (homepage), a virtual entry ban for bots and crawlers can be programmed. This barrier can be very useful for various reasons: If, for example, a website is under construction and still contains test data, it makes little sense for a Google crawler to index these pages. A web crawler is an automated program (also called a spider or bot) that searches the internet to collect and index content from websites. The crawler follows hyperlinks to discover new web pages and stores information such as titles, images, and keywords to create a searchable index for search engines such as Google or Bing.
Naturally, this prohibition sign for web crawlers at the beginning of a website can also be used to protect your own content. The prerequisite is, of course, compliance with WWW etiquette. In other words, any protection can be circumvented with even more sophisticated programming. There are numerous experiments that prove that the web crawlers of the major IT pioneers regularly bypass the virtual prohibition signs to train their LLMs.
Authors, journalists, artists, photographers, and all content producers consider this circumvention of a technical barrier to be a copyright infringement and theft of intellectual property. There are preliminary legal opinions and court rulings on this issue in the US. In short, some US judges believe that the prohibition signs can be circumvented for the purpose of AI training. However, this does not mean that these texts and photos may be used in AI responses and results. It is a fine line that may be legally tenable, but it contradicts human sensibilities.
So, for training purposes, the AI is allowed to read E3 magazines, but it is not allowed to quote them. A good summary from E3 is probably enough to help someone seeking assistance from the SAP community, which the AI can certainly do very well with the „training data.“ There is no need for the luxury of a verbatim quote—the cat is out of the bag anyway, right?
Ultimately, it is a financial problem: whoever used E3 content commercially had a business relationship with the publisher. This ensured the all-important give and take in the SAP community. And new sources could emerge. If AI now „plunders“ E3 sources without providing anything in return, there is a risk that E3 and many other independent SAP sources will dry up.
In a few years, only the official SAP websites and the user group's WWW offering may be available to AI for training large language models. The responses in the valley will be more modest. (pmf)






