How AI companies are secretly collecting training data from the web (and why it matters)

Like most individuals, my spouse sorts a search into Google many instances every day. We work at home, so our household room doubles as a convention room. Every time we’re in a gathering, and a query about something comes up, she Googles it.

This is similar as it has been for years. However what occurs subsequent has modified.

As a substitute of clicking on one of many search end result hyperlinks, she as a rule reads the AI abstract. Nowadays, she hardly ever clicks on any of the websites that present the unique info that Google’s AI summarizes.

Once I spoke to her about this, Denise acknowledged that she really visits websites much less steadily. However she additionally identified that, for subjects the place she’s well-versed, she has seen the AI is typically mistaken. She stated she takes the AI outcomes with a grain of salt, however they typically present primary sufficient info that she must look no additional. If unsure, she does dig deeper.

In order that’s the place we’re at present. An increasing number of customers are like my spouse, getting information from the AI and by no means visiting web sites (and subsequently by no means giving content material creators an opportunity to be compensated for his or her work).

Worse, an increasing number of individuals are trusting AI, so not solely are they making it more durable for content material creators to make a dwelling, however they’re typically getting hallucinatory or incorrect info. Since they by no means go to the unique sources of data, they’ve little impetus to cross-check or confirm what they learn.

The influence of AI scraping

Cloudflare CEO Matthew Prince provided some devastating statistics. He used the ratio of the variety of pages crawled in comparison with the variety of pages fed to readers as a metric.

As a baseline, he stated that 10 years in the past, for each two pages Google crawled, it despatched one customer to a content material creator’s web site. Six months in the past, that ratio was six pages crawled to at least one customer despatched to a content material web site. Now, simply six months later, it is 18 pages crawled to at least one customer despatched to a content material web site.

The numbers, in line with Prince, are far worse for AI websites. AI websites derive substantial worth from info they’ve scraped from all the remainder of us. Six months in the past, the ratio of pages scraped to guests redirected by way of OpenAI was 250 to 1. Now, as individuals have turn out to be extra conversant in trusting (or being too lazy to care about inaccuracies), the ratio is 1,500 to 1.

In some ways, AI is turning into an existential risk to content material creators. By vacuuming up content material produced by hard-working groups all the world over, after which feeding that content material again as summaries to readers, the publishers and writers are shedding income and affect. Many creators are additionally shedding motivation, as a result of if they can not make a dwelling doing it, or a minimum of create a following, why trouble?

Some publishers, like Ziff Davis (ZDNET’s father or mother firm) and the New York Occasions, are suing OpenAI for copyright infringement. You have most likely seen the disclaimer on ZDNET that claims, “Disclosure: Ziff Davis, ZDNET’s father or mother firm, filed an April 2025 lawsuit in opposition to OpenAI, alleging it infringed Ziff Davis copyrights in coaching and working its AI methods.”

Different publishers, together with the Wall Road Journal, the Monetary Occasions, the Atlantic, and the Washington Publish, have licensed their content material to OpenAI and another AI massive language fashions.

The harm to society as an entire that AI intermediation could cause is profound and value an article all by itself. However this text is extra sensible. Right here, we acknowledge the risk AI presents to publishing, and deal with technical methods to struggle again.

In different phrases, if the AIs cannot scrape, they can not give away printed and copyrighted content material with out publishers’ permission.

Robots.txt: Your first protection

The best, most direct, and probably least efficient protection is the robots.txt file. This can be a file you place on the root of your web site’s listing. It tells spiders, crawlers, and bots whether or not they have permission to entry your web site. That is additionally referred to as Person-Agent filtering.

This file has numerous attention-grabbing implications. First, solely well-behaved crawlers will take note of its specs. It does not present any safety in opposition to entry, so compliance is totally voluntary on the a part of the bots.

Second, it is advisable watch out which bots you ship away. For instance, in case you use robots.txt to disclaim entry to Googlebot, your web site will not get listed for looking out on Google. Say goodbye to all Google referrals. However, in case you use robots.txt to disclaim entry to Google-Prolonged, you will block Gemini from indexing and utilizing your web site for Gemini coaching.

This web site has an index of these bots you may wish to deny entry to. That is OpenAI’s information on the right way to stop OpenAI’s bots from crawling your web site.

However what about internet scrapers that ignore robots.txt? How do you stop them from scraping your web site?

How are you going to stop rogue scraping?

It is right here that web site operators want to make use of a belts-and-suspenders technique. You are mainly in an arms race to discover a strategy to defend in opposition to scraping, whereas the scrapers are looking for a strategy to suck down all of your web site’s information. On this part, I am going to checklist a number of strategies. That is removed from a whole checklist. Methods change always, each on the a part of the defenders and the scrapers.

Price restrict requests: Modify your server to restrict what number of pages will be requested by a given IP tackle in a time frame. People aren’t more likely to request a whole bunch of pages per minute. This, like many of the strategies itemized on this part, will differ from server to server, so you will must lookup your server to learn the way to configure this functionality. It might additionally annoy your web site’s guests a lot that they cease visiting. So, there’s that.

Use CAPTCHAs: Remember that CAPTCHAs are likely to inconvenience customers, however they will scale back some sorts of crawler entry to your web site. In fact, the irony is that in case you’re making an attempt to dam AI crawlers, it is the AIs which can be most definitely to have the ability to defeat the CAPTCHAs. So there’s that.

Selective IP bans: For those who discover there are IP ranges that overwhelm your web site with entry requests, you possibly can ban them on the firewall stage. FireHOL (an open supply firewall toolset) maintains a blacklist of IP addresses. Most of them are cybersecurity-related, however they will get you began on a block checklist. Watch out, although. Do not use blanket IP bans, or reputable guests will probably be blocked out of your web site. So, there’s that, too.

The rise of anti-scraping companies

There are a rising variety of anti-scraping companies that can try to defend your web site for a price. They embody:

QRATOR: Community-layer filtering and DDoS-aware bot blocking
Cloudflare: Popularity-tracking, fingerprinting, and behavioral evaluation
Akamai Bot Supervisor: Identification, intent, and behavioral modeling
DataDome: Machine studying plus real-time response
HUMAN Safety: JavaScript sensors with Al backend
Kasada: Adaptive challenges and so-called tamper-proof JavaScript telemetry
Imperva: Risk intelligence plus browser fingerprinting
Fastly: Rule-based filtering with edge logic
Fingerprint: Cross-session fingerprinting and consumer monitoring
Link11: Behavioral evaluation and visitors sandboxing
Netacea: Intent-based detection and server-side analytics

Here is a fast overview of a number of the strategies these companies use.

Habits matching: This system analyzes greater than headers; it analyzes request conduct. It is basically a mixture of header evaluation and bot-by-bot request limiting.

JavaScript challenges: Past JavaScript-based CAPTCHA, these typically run within the background of an online web page. They require scripts to execute or measure the pacing of interplay on the web page to permit additional entry.

Honeypot traps: These are sometimes parts buried in an online web page, like invisible fields or hyperlinks, which can be designed to seize bots. If a bot grabs every little thing on a web site (which a human consumer is unlikely to do), the honeypot entice acknowledges it and initiates a server block.

Total behavioral evaluation: That is the place AIs are combating AIs. AIs working on behalf of your web site monitor entry conduct, and use machine studying to determine entry patterns that aren’t human. These malicious accesses can then be blocked.

Browser fingerprinting: Browsers present a variety of knowledge about themselves to the websites they entry. Bots typically try to spoof the fingerprints of reputable customers. However they typically inadvertently present their very own fingerprints, which blocking companies can combination after which use to dam the bots.

Decoy traps: These are mazes of decoy pages stuffed with autogenerated and ineffective content material, linked collectively in a sample that causes bots to waste their time or get caught following hyperlinks. Most of these are tagged with “nofollow” hyperlinks, so engines like google do not index them or negatively have an effect on your search engine optimisation rank. In fact, malicious bots are studying the right way to determine these traps and counter them, however they do supply restricted safety.

The large trade-off of blocking scraping for AI coaching

As an writer who makes my dwelling straight from my inventive output, I discover the prospect of AIs utilizing my work as coaching information to be offensive. How dare an organization like OpenAI make billions off the backs of all of us creatives! They then flip round and supply a product that would doubtlessly put many people out of labor.

And but, I’ve to acknowledge that AI has saved me time in many various methods. I exploit a textual content editor or a phrase processor day-after-day. However again after I began my profession, the publications I wrote for had typesetting operators who transformed my written phrases into publishable content material. Now, the running a blog instruments and content material administration methods try this work. A complete career vanished within the house of some years. Such is the worth of latest expertise.

I have been concerned with AI innovation for many years. After writing about generative AI because it boomed in early 2023, I am satisfied it is right here to remain.

AI chatbots like Google Gemini and ChatGPT are making token efforts to be good residents. They scrape all our content material and make billions off of it, however they’re prepared to offer hyperlinks again to our work for the only a few who trouble to examine sources.

A number of the massive AI corporations contend that they supply worth again to publishers. An OpenAI spokesperson advised Columbia Journalism Overview, “We assist publishers and creators by serving to 400M weekly ChatGPT customers uncover high quality content material via summaries, quotes, clear hyperlinks, and attribution.”

Quoted in Digiday, David Carr, senior insights supervisor at information analytics firm Similarweb, stated, “ChatGPT despatched 243.8 million visits to 250 information and media web sites in April 2025, up 98% from 123.2 million visits this January.”

These numbers are massive, however solely with out context. Google will get billions of visits a day, and earlier than AI, practically all these visits resulted in referrals out to different websites. With Google’s referral percentages dropping precipitously and OpenAI’s referral numbers being a really small share of visitors in any other case despatched to content material producers, the issue may be very actual.

Sure, these hyperlinks are mere desk scraps, however can we block them? For those who allow internet scraping blocks in your web site, will it do something aside from “minimize off your nostril to spite your face,” as my mom used to say?

Except each web site blocks AI scrapers, successfully locking AI information units to 2025 and earlier, blocking your personal web site from the AIs will accomplish little greater than stopping you from getting what little visitors there’s from the AI companies. So must you?

In the long run, this follow of AI scraping is unsustainable. If AIs stop creatives from deriving worth from their laborious work, the creatives will not have an incentive to maintain creating. At that time, the standard of the AI-generated content material will start to say no. It should turn out to be a vicious circle, with fewer creatives in a position to monetize their abilities and the AIs offering ever-worsening content material high quality.

So, what can we do about it? If we’re to outlive into the longer term, our complete trade must ask and try to reply that query. If not, welcome to Idiocracy.

What about you? Have you ever taken any steps to dam AI bots from scraping your web site? Are you involved about how your content material may be used to coach generative fashions? Do you suppose the trade-off between visibility and safety is price it? What sorts of instruments or companies, if any, are you utilizing to observe or restrict scraping? Tell us within the feedback beneath.

You possibly can comply with my day-to-day challenge updates on social media. Remember to subscribe to my weekly replace publication, and comply with me on Twitter/X at @DavidGewirtz, on Fb at Fb.com/DavidGewirtz, on Instagram at Instagram.com/DavidGewirtz, on Bluesky at @DavidGewirtz.com, and on YouTube at YouTube.com/DavidGewirtzTV.