Should we preserve the pre-AI internet before it is contaminated?

NEW SCIENTIST

Jul 28, 2025

The arrival of AI chatbots marks a historical dividing line after which online material can’t be completely trusted to be human-created, but how will people look back on this change? While some are urgently working to archive “uncontaminated” data from the pre-AI era, others say it is the AI outputs themselves that we need to record, so future historians can study how chatbots have evolved.

Rajiv Pant, an entrepreneur and former chief technology officer at both The New York Times and The Wall Street Journal, says he sees AI as a risk to information such as news stories that form part of the historical record. “I’ve been thinking about this ‘digital archaeology’ problem since ChatGPT launched, and it’s becoming more urgent every month,” says Pant. “Right now, there’s no reliable way to distinguish human-authored content from AI-generated material at scale. This isn’t just an academic problem, it’s affecting everything from journalism to legal discovery to scientific research.”

Graham-Cumming has created a website called lowbackgroundsteel.ai to archive sources of data that haven’t been contaminated by AI, such as a full download of Wikipedia from August 2022. Studies have already shown that Wikipedia today shows signs of huge AI input.

Unmissable AI

Discussion about this post