You can divide the recent history of LLM data scraping into a few phases. There was for years an experimental period, when ethical and legal considerations about where and how to acquire training data for hungry experimental models were treated as afterthoughts.
Once apps like ChatGPT became popular and companies started commercializing models, the matter of training data became instantly and extremely contentious.
Authors, filmmakers, musicians, and major publishers and internet companies started calling out AI firms and filing lawsuits.
OpenAI started making individual deals with publishers and platforms — including Reddit and New York’s owner company, Vox Media — to ensure ongoing access to data for training and up-to-date chat content, while other companies, including Google and Amazon, entered into licensing deals of their own.
Despite these deals and legal battles, however, AI scraping became only more widespread and brazen, leaving the rest of the web to wonder what, exactly, is supposed to happen next.
Read more | NY MAG