How do I stop companies from scraping my site?

Question

This article says that one of the datasets for chatGPT was obtained by scraping all links with reddit with more than 2 upvotes: https://www.searchenginejournal.com/how-to-block-chatgpt-fro...I don't want big companies to scrape my content and then sell it on their platform.Novelty of LLM output may be an open question, but input is just someone else's stuff. I assumed that default copyright protects from this kind of bullshitttery. That it says that work can not be used, adapted, copied without creators permission. (I can only guess that it was allowed to happen, because that's the first time someone stole IP in this particular manner on this scale?) But now that we know that it's a thing, how can we maintain ownership of the inputs legally and engineering wise?

Kim_Bruning · Accepted Answer

Spidering and Web scraping in and of itself is permitted by law (eu) or fair use (us). It is not considered illegal and many people do it for many different purposes. There are many common tools and libraries to help with this on linux, mac os and windows. It's even legal to keep the copies in a searchable database. [1]
What is not then permitted is to give other people copies, or publish them on your website, or pretend it's your own work etc...
When it comes to LLMs or image generation models, they don't keep any copies and they don't generate any copies either, so they consider themselves to be well in the clear. [2]
If you want to stop people scraping your stuff anyway, you can always use robots.txt , or put things up behind a login-wall.
Do consider the morality of what you are doing though. Personally I feel that published data should be scrape-able where practical.
[1] https://en.wikipedia.org/wiki/Authors_Guild%2C_Inc._v._Googl.... (you're even allowed to do this with physical books)
[2] https://www.uspto.gov/sites/default/files/documents/OpenAI_R... (With apologies for my crude summary of their actual arguments)

hermannj314 · Answer

I sell apples at the market. How can I prevent people from buying my apples just to make pies or tarts and pretending those products are theirs?I grew the apple, I should be able to decide what people do with it. I have a sign that says the apples are only for eating, but people are ignoring it.

microflash · Answer

I wrote about this a while ago [1]. Unfortunately, with robots.txt, you're at the mercy of crawlers. They may respect it or ignore it altogether. You can block IP addresses but many crawlers may not even use static IP addresses.
You can go to extremes and put your content behind a login, as others have suggested. But that would also create friction for your intended audience.
[1]: https://www.naiyerasif.com/post/2023/09/30/blocking-ai-web-c...

quickthrower2 · Answer

You can gate it behind a login.Even a simple self made captcha (what is 2 + 7?) to reveal the content would probably stop LLMs.But hurt seo so you have to not rely on that.Do a medium and show a paragraph first then the login/captcha to continue.

joshuanapoli · Answer

OpenAI&rsquo;s web scraping bot will respect your robots.txt.https://platform.openai.com/docs/gptbot/disallowing-gptbot

throwawaydghvhv · Answer

On my site I don't need SEO, so I sprinkled it with invisible links leading to an endless Markov chain generated walls of garbage text. The robots seem to love it!(If you go similar route, don't forget rate limiting)

nonrandomstring · Answer

Make some of your content, which is invisible to normal users, so exceptionally toxic to the bots that they will begin to avoid your site by choice.

precompute · Answer

You literally can't. Data collection at scale is one of the three pillars of the current "AI" hype. It's an issue and it was by design (thanks, three-letters!) and now it will never ever be revoked. Assume all data on the internet is logged in some form and is available to people who shouldn't be able to access it and that those people use it to model things at scale, or sometimes, just store it so they can reinterpret it later. Storing data is cheap. Transmitting data is cheap. MITMing the world's data flow? Priceless.

mensetmanusman · Answer

Add a login with login information visible to humans. When logging in, have a prompt that says &lsquo;ai systems are not allowed to login lest you pay $&euro;&pound;&rsquo;

tikkun · Answer

1. Robots.txt will help somewhat [1]2. Put it behind a login wall[1]: https://platform.openai.com/docs/plugins/bot

ReflectedImage · Answer

Commerical web scraping services exist that can scrap any web content. The best you can do is ask nicely in a robots.txt file.

rchaud · Answer

If it's worth that much, just put it behind a login.

golly_ned · Answer

Steve Huffman? Is that you?

datavirtue · Answer

You can take it off the web. Frankly these concerns confuse me. You published so people can find value, no?

JoeyBananas · Answer

Only distribute your blog posts to people who sign an NDA.

How do I stop companies from scraping my site?

I sell apples at the market. How can I prevent people from buying my apples just to make pies or tarts and pretending those products are theirs?
I grew the apple, I should be able to decide what people do with it. I have a sign that says the apples are only for eating, but people are ignoring it.

You can gate it behind a login.
Even a simple self made captcha (what is 2 + 7?) to reveal the content would probably stop LLMs.
But hurt seo so you have to not rely on that.
Do a medium and show a paragraph first then the login/captcha to continue.

OpenAI’s web scraping bot will respect your robots.txt.
https://platform.openai.com/docs/gptbot/disallowing-gptbot

On my site I don't need SEO, so I sprinkled it with invisible links leading to an endless Markov chain generated walls of garbage text. The robots seem to love it!
(If you go similar route, don't forget rate limiting)