Can I stop OpenAI from using my website data as training data?

Question

So OpenAI says that we cant use the data generated by OpenAI API to train competing models. Can I have a similar ToS for my website so that I don't allow OpenAI to train on my website data for the next versions of GPT?

ktbwrestler · Accepted Answer

Correct me if I&rsquo;m wrong,but there&rsquo;s no law saying that you can&rsquo;t crawl a site if it has a robots.txt configured, it&rsquo;s just a convention, and you wouldn&rsquo;t be able to have any recourse telling a company they can&rsquo;t pull your info if it&rsquo;s on the public web.

dmux · Answer

There were some recent (but limited) discussions about this:
- "Ask HN: Prevent ML like GPT from using public posts like this one?" https://news.ycombinator.com/item?id=33980566
- "Ask HN: Does ChatGPT respect Robots.txt?" https://news.ycombinator.com/item?id=35027823

MontgomeryPi2 · Answer

As the answer would seem to be 'no' (crawlers can ignore robots.txt rules), I'm wondering if GPT is going to usher in an era where new content is simply not made available to view/crawl on the web. E.g. Want to know all the cool events happening this weekend in NYC? Ask our vertical gpt-chat site to find out.

zzo38computer · Answer

I want to allow it, but only if they also allow data generated by them to be used to train competing models (and data from those competing models to also be used to train competing models of them too, etc). And if someone wants to copy the data for another purpose (e.g. making a database, or making a movie), they cannot restrict that purpose either.(I expect that the AI will still probably have similar kinds of problems than it does now regardless of whether or not it is allowed, though.)

perrygeo · Answer

Ultimately, whether web scraping for AI training data purposes is considered legal comes down to whether web scraping for ANY purpose is legal.This isn't clear one way or the other, though recent rulings indicate that US courts are firmly poised to declare web scraping of public content fully legal. IE there is nothing you can do to stop it other than making it not public.

motbus3 · Answer

We were discussing this in my job some days ago.Legal team is still checking, but technically you can copyright and define usage license for your content. But if you write about common subjects or non copyright material you may be better off the internet

sourcecodeplz · Answer

Yes, make it private.

MikeTheRocker · Answer

Curious why?

dev_0 · Answer

Use simple authentication headers

Can I stop OpenAI from using my website data as training data?

So OpenAI says that we cant use the data generated by OpenAI API to train competing models. Can I have a similar ToS for my website so that I don't allow OpenAI to train on my website data for the next versions of GPT?

Correct me if I’m wrong,but there’s no law saying that you can’t crawl a site if it has a robots.txt configured, it’s just a convention, and you wouldn’t be able to have any recourse telling a company they can’t pull your info if it’s on the public web.

There were some recent (but limited) discussions about this:
- "Ask HN: Prevent ML like GPT from using public posts like this one?" https://news.ycombinator.com/item?id=33980566
- "Ask HN: Does ChatGPT respect Robots.txt?" https://news.ycombinator.com/item?id=35027823

We were discussing this in my job some days ago.
Legal team is still checking, but technically you can copyright and define usage license for your content. But if you write about common subjects or non copyright material you may be better off the internet

Yes, make it private.

Curious why?

Use simple authentication headers

Can I stop OpenAI from using my website data as training data?

So OpenAI says that we cant use the data generated by OpenAI API to train competing models. Can I have a similar ToS for my website so that I don't allow OpenAI to train on my website data for the next versions of GPT?

Correct me if I’m wrong,but there’s no law saying that you can’t crawl a site if it has a robots.txt configured, it’s just a convention, and you wouldn’t be able to have any recourse telling a company they can’t pull your info if it’s on the public web.

There were some recent (but limited) discussions about this:- "Ask HN: Prevent ML like GPT from using public posts like this one?" https://news.ycombinator.com/item?id=33980566- "Ask HN: Does ChatGPT respect Robots.txt?" https://news.ycombinator.com/item?id=35027823

We were discussing this in my job some days ago.Legal team is still checking, but technically you can copyright and define usage license for your content. But if you write about common subjects or non copyright material you may be better off the internet

Yes, make it private.

Curious why?

Use simple authentication headers

There were some recent (but limited) discussions about this:
- "Ask HN: Prevent ML like GPT from using public posts like this one?" https://news.ycombinator.com/item?id=33980566
- "Ask HN: Does ChatGPT respect Robots.txt?" https://news.ycombinator.com/item?id=35027823

We were discussing this in my job some days ago.
Legal team is still checking, but technically you can copyright and define usage license for your content. But if you write about common subjects or non copyright material you may be better off the internet