Can I stop OpenAI from using my website data as training data?
So OpenAI says that we cant use the data generated by OpenAI API to train competing models. Can I have a similar ToS for my website so that I don't allow OpenAI to train on my website data for the next versions of GPT?
Correct me if I’m wrong,but there’s no law saying that you can’t crawl a site if it has a robots.txt configured, it’s just a convention, and you wouldn’t be able to have any recourse telling a company they can’t pull your info if it’s on the public web.
As the answer would seem to be 'no' (crawlers can ignore robots.txt rules), I'm wondering if GPT is going to usher in an era where new content is simply not made available to view/crawl on the web. E.g. Want to know all the cool events happening this weekend in NYC? Ask our vertical gpt-chat site to find out.
I want to allow it, but only if they also allow data generated by them to be used to train competing models (and data from those competing models to also be used to train competing models of them too, etc). And if someone wants to copy the data for another purpose (e.g. making a database, or making a movie), they cannot restrict that purpose either.
(I expect that the AI will still probably have similar kinds of problems than it does now regardless of whether or not it is allowed, though.)
Ultimately, whether web scraping for AI training data purposes is considered legal comes down to whether web scraping for ANY purpose is legal.
This isn't clear one way or the other, though recent rulings indicate that US courts are firmly poised to declare web scraping of public content fully legal. IE there is nothing you can do to stop it other than making it not public.
We were discussing this in my job some days ago.
Legal team is still checking, but technically you can copyright and define usage license for your content.
But if you write about common subjects or non copyright material you may be better off the internet
Use simple authentication headers