Why is ChatGPT allowed to scrape other sites via prompts?

Question

The fact that I can give ChatGPT any URL and extract html content from it feels like a big TOS breach for most sites. Am I misunderstanding something about the legality of scraping? Aren't developers discouraged from scraping like this in the first place for for-profit projects?

bicx · Accepted Answer

Google scrapes like a maniac. And for profit. Many others do the same.
A website can put up a TOS prohibiting such use, but my understanding is that is essentially unenforceable if the site is publicly accessible.
The recent Meta v Bright Data case highlights how extreme it can get without being technically illegal. https://techcrunch.com/2024/02/26/meta-drops-lawsuit-against...
If you’re trying to prevent scraping of your data, your best option is to not make it public.

Nextgrid · Answer

If you can paste the URL in a browser and copy paste the next, why is it bad that a third-party agent can do the same? It's no different than a remotely-hosted browser you control via natural language, or asking a human assistant to do it and email you the result.

persedes · Answer

I've encountered a couple of robots.txt that specifically block popular llms for certain areas. Example:https://www.sigmaaldrich.com/robots.txt

icedchai · Answer

My understanding is scraping public sites is legal. It's no different from a search engine crawling your site.

brianjking · Answer

You can opt out.https://platform.openai.com/docs/gptbot

tripplyons · Answer

Scraping and violating TOS are not illegal to do, but they can get you blocked.

xcasperx · Answer

I believe this is current precedent around scraping:https://en.wikipedia.org/wiki/HiQ_Labs_v._LinkedIn

brudgers · Answer

Terms of service enforcement is a matter of civil law.Your legal wherewithal relative to those who abuse them is what gives your terms of service teeth. Or leaves you toothless.

mensetmanusman · Answer

Preventing scraping also entrenches google for eternity.

rl3 · Answer

The web agent's system prompt is simply informed that Scarlett Johansson's voice is at the URL it's about to visit.

8note · Answer

Why? It's another user agent. Curl does the same thing, as does chrome and firefox

Why is ChatGPT allowed to scrape other sites via prompts?

The fact that I can give ChatGPT any URL and extract html content from it feels like a big TOS breach for most sites. Am I misunderstanding something about the legality of scraping? Aren't developers discouraged from scraping like this in the first place for for-profit projects?

If you can paste the URL in a browser and copy paste the next, why is it bad that a third-party agent can do the same? It's no different than a remotely-hosted browser you control via natural language, or asking a human assistant to do it and email you the result.

I've encountered a couple of robots.txt that specifically block popular llms for certain areas. Example:
https://www.sigmaaldrich.com/robots.txt

My understanding is scraping public sites is legal. It's no different from a search engine crawling your site.

You can opt out.
https://platform.openai.com/docs/gptbot

Scraping and violating TOS are not illegal to do, but they can get you blocked.

I believe this is current precedent around scraping:
https://en.wikipedia.org/wiki/HiQ_Labs_v._LinkedIn

Terms of service enforcement is a matter of civil law.
Your legal wherewithal relative to those who abuse them is what gives your terms of service teeth. Or leaves you toothless.

Preventing scraping also entrenches google for eternity.

The web agent's system prompt is simply informed that Scarlett Johansson's voice is at the URL it's about to visit.

Why? It's another user agent. Curl does the same thing, as does chrome and firefox