A concerning development has emerged regarding how AI companies are sourcing training data for their models: the Common Crawl Foundation, a non-profit organization that builds a publicly accessible archive of the internet, is allegedly enabling them to access and utilize paywalled content from major news publishers. This practice has sparked a debate about copyright, fair use, and the future of the journalism industry.
What is Common Crawl and Why is It Relevant?
Common Crawl’s primary mission is to create a massive, publicly available archive of the internet. It operates by “scraping” the web— automatically collecting data from publicly accessible websites. This data, spanning multiple petabytes, is then made available to researchers, academics, and, as recent reporting suggests, AI companies like Google, Anthropic, OpenAI, and Meta. The foundation’s website claims its data is solely collected from freely available web pages, but this claim is now under scrutiny.
The Allegations: A Backdoor for AI Data Acquisition
According to an investigative report in The Atlantic, several major AI companies have quietly partnered with Common Crawl, effectively creating a backdoor for accessing paywalled content. Reporter Alex Reisner details how Common Crawl’s archive allows AI companies to train their models on material from news organizations like The New York Times, Wired, and The Washington Post — publications that rely on subscriptions and paywalls for revenue. The foundation’s executive director, Richard Skrenta, believes that AI models should have access to everything on the internet, a stance that clashes with the copyright protections afforded to publishers.
Impact on the Journalism Industry: The “Traffic Apocalypse”
The rise of AI chatbots like ChatGPT and Google Gemini has already created a crisis for the journalism industry. These chatbots can scrape information from publishers and present it directly to users, diverting traffic and potential revenue away from news websites. This phenomenon, sometimes referred to as the “traffic apocalypse” or “AI armageddon,” poses a significant threat to the financial stability of news organizations. Mashable’s parent company, Ziff Davis, has even filed a lawsuit against OpenAI over copyright infringement, highlighting the growing legal challenges.
Publishers’ Efforts to Remove Content and Common Crawl’s Response
Several news publishers have become aware of Common Crawl’s activities and have requested that their content be removed from the archive. However, the removal process has proven to be slow and complex. While Common Crawl claims to be complying with these requests, the Atlantic’s reporting suggests that many of these takedown requests haven’t been fulfilled. The organization has also admitted that its file format is designed to be “immutable,” meaning content is difficult to delete once added. Furthermore, Common Crawl’s public search tool is returning misleading results for certain domains, masking the scope of the scraped data.
Common Crawl’s Defense and Potential Conflicts of Interest
Common Crawl has strongly denied the accusations of misleading publishers. In a blog post, Richard Skrenta stated that the organization’s web crawler doesn’t bypass paywalls and that Common Crawl is “not doing AI’s dirty work.” However, the foundation has received donations from AI-focused companies like OpenAI and Anthropic and lists NVIDIA as a “collaborator”— raising questions about potential conflicts of interest. Beyond simply collecting raw text, Common Crawl also assists in assembling and distributing AI training datasets, sometimes even hosting them for wider use.
The Bigger Picture: Copyright and the Future of AI Training
The controversy surrounding Common Crawl highlights a larger debate about how the AI industry utilizes copyrighted material. Major publishers, including The New York Times and Ziff Davis, are already engaged in lawsuits against AI companies. The legal and ethical implications are significant, and the battle over copyright and fair use is far from over, representing a crucial moment for both the AI industry and the future of digital publishing.
