The non-profit doing the AI industry’s dirty work

THE ATLANTIC

Nov 05, 2025

The Common Crawl Foundation is little known outside of Silicon Valley. For more than a decade, the nonprofit has been scraping billions of webpages to build a massive archive of the internet.

This database—large enough to be measured in petabytes—is made freely available for research. In recent years, however, this archive has been put to a controversial purpose: AI companies including OpenAI, Google, Anthropic, Nvidia, Meta, and Amazon have used it to train large language models.

In the process, my reporting has found, Common Crawl has opened a back door for AI companies to train their models with paywalled articles from major news websites. And the foundation appears to be lying to publishers about this—as well as masking the actual contents of its archives.

Unmissable AI

The non-profit doing the AI industry’s dirty work

THE ATLANTIC