Reddit blocks the Internet Archive from crawling its data – here’s why

ZDNET’s key takeaways

The Web Archive can now solely crawl Reddit’s homepage.
Reddit’s objective is to dam AI companies from scraping Reddit consumer information.
Publishers (and others) are suing AI corporations for copyright infringement.

Reddit is defending its privateness from AI corporations which are taking roundabout approaches to scraping its content material.

The social media platform, generally known as a useful resource the place customers can put up anonymously and discover details about nearly any topic, will block the Web Archive’s Wayback Machine from indexing its on-line information, in line with a Monday report from The Verge. The transfer is in response to the invention that AI companies, unable to scrape information from Reddit immediately because of the platform’s prohibitive insurance policies, have as a substitute been retrieving its information from listed content material on the Web Archive and utilizing it to coach fashions.

The Wayback Machine will now solely have the ability to scrape information from Reddit’s homepage, in line with The Verge, whereas entry to consumer profiles, feedback, and put up element pages will probably be blocked.

Launched in 1996, the Web Archive is a non-profit that operates an infinite digital database of internet content material. The archive is maintained partially by the Wayback Machine, a chunk of web-crawling software program that gathers internet pages and preserves them as they appeared after they have been collected, like digital flies in amber. This serves as a useful resource for researchers learning the evolution of on-line tradition and digital forensic proof for regulation enforcement, amongst different makes use of.

What Reddit’s transfer means

Reddit has beforehand flagged considerations associated to the scraping of its content material with the Web Archive, in line with The Verge. The non-profit was additionally reportedly notified earlier than the web-crawling restrictions began going into impact yesterday.

The Web Archive has but to make an official assertion about the way it plans to reply to Reddit’s new restrictions, and on the time of writing, it has not responded to ZDNET’s request for remark. Wayback Machine director Mark Graham, nevertheless, has advised a number of publications that the Web Archive will “proceed to have ongoing discussions about this matter” with Reddit.

Rising pressure

Reddit’s reported determination to dam Wayback Machine from scraping the vast majority of its content material arrives throughout a second of mounting pressure between AI corporations and digital publishers, although Reddit is the primary tech firm to wade into the talk. The corporate sued Anthropic in June after discovering that the AI firm was illegally scraping its information, however it has additionally beforehand signed licensing offers with each Google and OpenAI.

(Disclosure: Ziff Davis, ZDNET’s dad or mum firm, filed an April 2025 lawsuit in opposition to OpenAI, alleging it infringed Ziff Davis copyrights in coaching and working its AI techniques.)

AI builders require entry to gargantuan troves of data to coach generative AI fashions, that are designed to establish and replicate refined mathematical patterns gleaned from these coaching datasets.

Lots of these corporations have scraped coaching information from publicly accessible web sites, together with social media websites and information retailers, claiming authorized immunity below an idea identified in copyright regulation as truthful use. (The courts are nonetheless untangling the legitimacy of that argument, and can probably be doing so for a while.)

Most of the organizations whose content material has been copiously scraped — together with a cohort of authors and different artists — have responded with lawsuits.

Others, in the meantime, have signed content material licensing agreements with the likes of OpenAI, Anthropic, and Google, consenting to using their organizations’ information in trade for elevated visibility within the responses generated by chatbots, or different advantages.