How do I save my content from AI crawlers?

Question

I am building something where we (a very small team) are creating content the old fashioned way and for each content it takes us a lot of effort. If I wanted to be publicly viewable but I am equally worried about AI crawlers now. Is putting the content behind auth the only way or are there other means ?

austin-cheney · Accepted Answer

The only way to prevent crawlers from scraping your content is to make it unavailable to them. I am not suggesting any clever application or code trick. When I say unavailable I mean a separate network. Network isolation is ideal but there is also VPN intranet.You cannot have both public access content and yet limit it to limited parties at your discretion. The content is either public or it&rsquo;s not and obfuscation, even via authentication, is not a security solution. The role of authentication is not to limit access but to segment access such that account specific content is not intermingled between accounts.

talldayo · Answer

> Is putting the content behind auth the only wayNo because I can still copy and redistribute your auth-locked content. By writing something and distributing that writing you are making it possible to train on the contents. That's the way things are now.

amerkhalid · Answer

Add a reasonable fee in ToS. Something like $100 per article, and $5 per article per year for continued use in a model.
It may discourage some companies from scrapping your content.
And if they ignore the license, you may have grounds to collect the fee if it becomes obvious that they are using your content in the AI models. (Unlikely to collect though)

0xdeadbeefbabe · Answer

Disallow them in robots.txt. Visit https://[originating IP] and opt out. Email abuse@[owner of their IP]. It would be fun to distinguish an AI crawler from a regular crawler, btw.

elevation · Answer

I wanted to serve content to a regional area but avoid discovery by spammers and wasting bandwidth on crawlers. I didn’t need SEO as I had a separate channel to notify users my site existed. At first I thought I’d blacklist the bots: no Google, no bing, no OpenAI. But I realised that a whitelist would be much more effective.
The process I used was to gather the ASNs of every local ISP where my users are. Thanks to BGP looking glass services, you can convert this to a list of every advertised prefix. You can coalesce this into an optimised data structure using “ipset” to build a firewall rule that affects only addresses you whitelist.
I also found a data center offering colocation which is directly peered with every major ISP in my area and a few of the minor ones.
Any visitor that doesn’t match my whitelist sees a decoy page. For the protected content, I set IP TTL just high enough to deliver only within the service area.
One drawback of this approach is it’s difficult to include local mobile users (without including their entire parent network) because they will need higher TTLs and come from a network with many prefixes. I’d be interested in an iptables rule that can drop connections after a few packets based on an RTT heuristic.
This approach is the opposite of “zero-trust” and it fails as soon as your threats start moving into into your user’s network.

gabriel_dev · Answer

perhaps adding a challenge (captcha, other) if the pattern of the client is being considered malicious (requests per min, avg time on page, other stats)? This might not entirely block the crawler, but will still provide data that will allow blocking entirely the source if it's a cloud IP range, degrading the velocity of crawling if measures taken. The other way around would be displaying publicly a degraded in any way copy of the content and delivering a full (high quality) one after signing in.Overall - simply making life a bit harder for the crawler (AI or plain old hardcoded one).

lbhdc · Answer

https://darkvisitors.com/agentsI saw a saas recently that has something targetting this. They will automatically update your robot.txt to include different crawlers they find.This will only stop crawlers that obey robots.txt though.

dakiol · Answer

Wouldn't it be possible to block IP ranges that belong to known cloud providers? Normal people using the browser don't have such IP address assigned. You would be blocking as well other kind of visitors (scrappers and the like), but I guess that's a fair price to pay.

meiraleal · Answer

I just thought about an architecture that would be able to prevent access from AI crawlers while still public for anyone to access.What would be the demand for something like this as a SaaS? Wondering if it would be worth developing it.

solardev · Answer

No surefire way, same as blocking any other poorly-behaved crawlers. You can use robots.txt, Cloudflare, CAPTCHAs, IP range blocks, etc., but at the end of the day none of those are foolproof. If someone really wants to scrape you, they will.

JohnFen · Answer

I really don't think there is any solution to this right now. This problem makes the web unviable for me as a publication/distribution platform.

Crier1002 · Answer

i often browse the web for fun (but in a careful way), I can totally see how changing up the CSS class names in the site&rsquo;s HTML regularly would mess up a bunch of the XPATH/CSS selectors in the crawler. It&rsquo;d seriously be a nightmare for me if the site owners could just flip a 'switch' and change the class names easily

minimaxir · Answer

Hope companies obey robots.txt.

How do I save my content from AI crawlers?

> Is putting the content behind auth the only way
No because I can still copy and redistribute your auth-locked content. By writing something and distributing that writing you are making it possible to train on the contents. That's the way things are now.

Disallow them in robots.txt. Visit https://[originating IP] and opt out. Email abuse@[owner of their IP]. It would be fun to distinguish an AI crawler from a regular crawler, btw.

https://darkvisitors.com/agents
I saw a saas recently that has something targetting this. They will automatically update your robot.txt to include different crawlers they find.
This will only stop crawlers that obey robots.txt though.

Wouldn't it be possible to block IP ranges that belong to known cloud providers? Normal people using the browser don't have such IP address assigned. You would be blocking as well other kind of visitors (scrappers and the like), but I guess that's a fair price to pay.

I just thought about an architecture that would be able to prevent access from AI crawlers while still public for anyone to access.
What would be the demand for something like this as a SaaS? Wondering if it would be worth developing it.

No surefire way, same as blocking any other poorly-behaved crawlers. You can use robots.txt, Cloudflare, CAPTCHAs, IP range blocks, etc., but at the end of the day none of those are foolproof. If someone really wants to scrape you, they will.

I really don't think there is any solution to this right now. This problem makes the web unviable for me as a publication/distribution platform.

Hope companies obey robots.txt.