Isn't ChatGPT unfair to the sources it scraped data from?

Question

ChatGPT scraped data from various sources on the internet.> The model was trained using text databases from the internet. This included a whopping 570GB of data obtained from books, webtexts, Wikipedia, articles and other pieces of writing on the internet. To be even more exact, 300 billion words were fed into the system.I believe it's unfair to these sources that ChatGPT drives away their clicks, and in turn the ad income that would come with them.Scraping data seems fine in contexts where clicks aren't driven away from the very site the data was scraped from. But in ChatGPT's case, it seems really unfair to these sources and the work that the authors put, as people would no longer even to attempt to go to these sources.Can this start breaking the ad-based model of the internet, where a lot of sites rely upon the ad income to run servers?

ergonaught · Accepted Answer

Is it unfair for you to create content/products/etc after you have read and learned from various sources on the internet, potentially depriving them of clicks/income?People get internet hostile at me for this question, but it really is that simple. They've automated you, and it's definitely going to be a problem, but if it's acceptable for your brain to do the same thing, you're going to have to find a different angle to attack it than "fairness".

lukev · Answer

I think this is a real concern, but imagine a couple other scenarios:
1. You have a widely read spouse named Joe who reads constantly. He's got a good memory, and typically if you have a question you just ask him instead of searching for it yourself. Are you depriving Joe's sources of your eyeballs?
2. Many books summarize and restate other books. If I read Cliff's Notes on a book, for example, I can learn a lot about the original book without buying it. Is this depriving the author?
3. I have a website that proxies requests to other websites and summarizes them while stripping out ads.
So which of these examples are a better metaphor for what a LLM does?
I don't know. The fact is, LLMs are a new thing in our tech and culture and they don't quite fit into any of our existing cultural intuitions or norms. Of course it's ambiguous! But it's also exciting.

anileated · Answer

It is not breaking the ad-based model&mdash;it&rsquo;s breaking open information sharing culture as we know it.Yesterday: 1) You do research, you publish a book, you write some posts. 2) People discover your work and you personally, they visit your posts and subscribe to you. 3) You have an opportunity to upsell your book and make money on ads to sustain your future work; more importantly, you get to see traffic stats and see what is in demand, you get thank-you emails and feel valued.Tomorrow: 1) you do research, write posts, publish a book, 2) it is all consumed by a for-profit operated LLM. 3) People ask LLM to get answers, and have no reason or even opportunity to buy your book or know you exist.What exactly are the incentives to publish information openly in that world?(Will they even believe you if you say you&rsquo;re the one who did the niche research powering some specific ChatGPT answer, in a world everyone knows that you can just ask an LLM?)

daevout · Answer

Yes it absolutely is, but imo less so than what GitHub Copilot and various image generation companies are doing. My theory is that if AI turns out to be as disruptive as the current hype suggests, the conflict between those who feed the AI vs. those who profit from it might be the next big social rift.
Artists are already in full rebellion against this, as they should be, being nearly eclipsed by AI, except when it comes to inventing new styles and hand-crafting samples for the models to train on. These, I assume, are either scraped off the web, or signed away in unfair ToS of various online publishing platforms.
Since the damage individually is small (they took some code from me without attribution, ok) but collectively enormous, in my opinion it the role of government to step in and soften the blow if necessary.

Sakos · Answer

Absolutely, yes. It's incredibly unfair. But techbros here and elsewhere don't care about you or me or people in general and they'll think up an infinite amount of ridiculous false equivalencies before admitting the risks and real harms.

blue_cookeh · Answer

100% agree - the likes of ChatGPT are straight up generating revenue based on adding value to stolen work.

dmak · Answer

All I have to say is, as technologists, anyone who is criticizing ChatGPT and has not been criticizing Google is a hypocrite. It's well known Google tries to keep you on Google by parsing more and more information from websites and summarizing it. Ex, Wikipedia summaries, IMDB Scores, Review Stars, etc...If you have a problem with ChatGPT's "scraped data", then you have more fundamental issues with how the internet is as it is today.

williamcotton · Answer

Ah, our daily dose of a bunch of people with basically no understanding of copyright law or even the basic concepts of tort or common law jurisprudence make all sorts of silly anthropomorphic arguments about &ldquo;how computers think&rdquo;.Please, people, learn how to focus your thoughts. Go read up on copyright law in the United States. If you go into learning about copyright law trying to justify your own preconceived notions you will gain nothing.

oceanplexian · Answer

The dirty secret of how so many social media giants got their initial traction in the early growth stage they scraped content. LinkedIn is one I have personal knowledge of. Facebook another. How do you think they got a critical mass of users? Scraping and fake engagement. Back in the 00's when they were startups operating in little offices in the SF Bay, they had teams of people running Beautiful Soup and were building bots to build profiles and stuff.I'm actually not really sure I have an opinion on the ethics of it. Same argument as Adblock. You don't get to control how people consume your content if you put it out in the world for free. That goes for profiles, or articles, reddit posts, StackOverflow, etc. The only thing that's ironic is that large tech companies throw a fit whenever you want to turn the tables and scrape them.

anthropodie · Answer

Well you could say same thing about the answers that Google displays on it's pages instead of search results! If you don't want these crawlers to index your content I am pretty sure you can disable via robots.txt just like Google.

JoeAnon · Answer

All I know is that while this isn't a new issue, the likes of ChatGPT has brought it to a head and made it more urgent. I am seriously reconsidering whether or not I want my writings to be available on the internet at all. I object to many of the uses, including this, they can be put to, and not publishing them online appears to be the only control available.For now, I have removed my existing works, both technical and creative, from the internet and won't be adding more while I try to work out what to do.

spaintech · Answer

The importance of source citation in ChatGPT's responses is a topic of debate, particularly as the platform shifts towards a paid model. While ChatGPT is designed to deliver information in a conversational and user-friendly way, it is important to consider the potential legal implications of using unverified or uncited information. In sensitive or controversial cases, it is advisable to properly cite sources to ensure accuracy and avoid any potential issues of intellectual property infringement.On the other hand, the focus on the potential of ChatGPT's natural language processing capabilities highlights the significance of learning and using LLM (Language Models) in data handling. The utilization of LLM can potentially lead to a future where traditional databases become obsolete and are replaced by advanced language models. As such, the development and integration of LLM in our daily lives and processes can bring about many benefits and possibilities.

tqwhite · Answer

No. It's not. Also, it's not unfair if I study someone's work and then learn from it. Also, it's not unfair if you see my internet present and are inspired to do similar things.
At some point participating in the internet means your stuff is going to be seen. I wear glasses to read web content. I don't think the glasses company should pay royalties for what I read. chatGPT is a tool that allows me to understand and use the information people put onto the internet better.
Far from a matter of fairness, this is simply another way that selfish people are trying to monetize the future, to make it more and more difficult and expensive for others to participate.
"I've always wished I could charge everyone one earth. chatGTP looks like the future. If I can tap the money flow there I will get mo' money."
I'm against it.

pharke · Answer

You can also invert this and say that without a system like ChatGPT it is physically impossible for most people to find or use those 570GB of data. A search engine can only get you so far and over time they are becoming less useful as the net floods with junk content. If you don't even know what terms to search for then ChatGPT wins out since you can start with a very simple question and then interrogate it further on details it produces. The best way to think about it is as a better search engine, a fully interactive one that also has some degree of its own agency when it comes to synthesizing data. It could be better, it would be nice to have the option to show sources for the output so that you can verify the facts or do your own research.

peyton · Answer

Google piggybacks on the same sorts of data to rank results and display ads without compensating site owners. They track billions of people without paying them. I don&rsquo;t think it&rsquo;s any more unfair if OpenAI built a better product.

gchamonlive · Answer

I think chatgpt just exacerbates a problem that was already pervading the free internet business model, which is that Ad revenue model is outdated and exhausted without a clear alternative.It maybe was unfair to telephone operators when connection automation was implemented, as it made operators obsolete, but the older model couldn't scale, the same way reading text from source doesn't scale for human productivity.

corobo · Answer

This argument came up a bunch a while back. I settled on the opinion that while it's possible to buy summaries of books, I don't give a fart in a breeze where ChatGPT got it's data.
E.g. Summary of How to Win Friends and Influence People: Effective Steps to Better Interpersonal Relationships by Book Lyte
ChatGPT does more of a mashup with the learned data than humans need to, that'll do me.

jodrellblank · Answer

> &ldquo;Can this start breaking the ad-based model of the internet, where a lot of sites rely upon the ad income to run servers?&rdquo;We can only hope. It&rsquo;s unfair to someone that my browser can ask your server for a page, I see an ad for random bullshit nobody would ever care about, and money changes hands behind the scenes and that counts as an economic transaction which boosts GDP. It&rsquo;s unfair (in my favour) that I can piggy back off this to get things for free.And when I say &ldquo;someone&ldquo; I suspect &ldquo;everyone&rdquo;. Sadly spending money advertising &ldquo;Yorkshire woman finds guaranteed way to win on the horses&rdquo; doesn&rsquo;t seem to have caused anyone to run out of money and have the whole thing collapse yet. And it&rsquo;s unfair on real small businesses with products paying for adverts which people don&rsquo;t see or are clicked by bots or are misreported and all they can do is throw money at Google and Facebook and hope.

bediger4000 · Answer

I kind of agree with you, but I think that's only because we've all been saturated with the idea of everlasting ownership of ideas.
Clearly, ownership of ideas runs out, because we all use linked lists or binary trees, or paper, or turbines or the list goes on. We don't pay money to the inventors of linked lists, or the heirs or successors-in-interest to the inventor of paper. Why not? When does ownership of an idea expire? Why do we unconsciously accept copyright or patent limits of today?
There's also an issue with simultaneous invention, but that's out of scope here. Clearly ChatGPT is just regurgitating or otherwise emitting previously-ingested material.

beardyw · Answer

When this becomes properly entrenched I fear that it may create a disincentive to create original content. If that happens we will all be poorer for it in return for amazing access to what we already have. I don't think it is a good deal.

pencilcode · Answer

At the minimum, systems like chatgpt should be forced to link to their sources, so it gives something back and so that its assertions can be verified - right now, it&rsquo;s just good at bullshitting through questions.

snshn · Answer

It's akin to saying that everyone who writes a book today must give credit to everyone who contributed to the creation of modern written language and printing tools.

Brian_K_White · Answer

I think the copyright ethicality of the current class of AIs is about like religion or guns.Discussion is pointless because everyone already has an opinion and it's very firm.

mkl95 · Answer

Big Tech companies have been scraping massive amounts of data for about two decades. Many smaller companies have tried to imitate them (remember when Big Data was the hottest thing out there? How do you think most of those startups obtained their data?) but pretty much all of them failed, mainly by running out of cash. OpenAI just happened to win the scraping lottery.

distantsounds · Answer

"your scientists were so preoccupied with whether or not they could, they didn't stop to think if they should."

bdhcuidbebe · Answer

> I believe it's unfair to these sources that ChatGPT drives away their clicks, and in turn the ad income that would come with them
Google is doing this in search results for years, so does bing. apple also does this in their built in dictionary.
why rant about chatgpt that currently at least is a small company in comparison.

fatneckbeard · Answer

chatgpt actually has some ideas about this

question: How could the people who generate used in an ai language model be paid for their work?

answer: There are several ways in which the people who generate content for an AI language model could be paid for their work:

    Royalty-based payment: Content creators could receive a percentage of the revenue generated from the use of their content in the AI language model.

    Token-based payment: If the AI language model is built on a blockchain, content creators could be paid in tokens that could be traded for cryptocurrency or fiat currency.

    Partnership with content publishers: The developers of the AI language model could partner with content publishers to compensate the creators of the scraped content.

jijji · Answer

The data it was scraped from was then put into vector maps and usedd to create a model which is used from zero to create unique sentences that summarize what the model relates to. The text results coming out are neither copyright infringement nor plagiarism.

acadiel · Answer

There needs to be a &ldquo;raw source&rdquo; option that puts links to everything the model spits out that the user can enable or disable. This can give credit to whatever the model cites, and also help us understand a little of how it&rsquo;s linking things together.

lr4444lr · Answer

Not any more than I don't need to keep paying my teacher once I learn what she knows. ChatGPT's value isn't in what it knows: it's in what it understands from your prompt in terms of that sea of information.

unnouinceput · Answer

1- It's on internet. If it's on Google site then it's free. If you want then use robots.txt (not that ever stopped google's spider to index your pages)
2 - Code was trained from GitHub. GitHub is Microsoft. OpenAI is Microsoft money. So Microsoft trained its AI on Microsoft code. You disagree? Then GTFO from GitHub and don't feed Microsoft your code anymore.
3 (the most important point) - Q: "Can this start breaking the ad-based model of the internet, where a lot of sites rely upon the ad income to run servers?"
Fuck YEAH!! please do so. I hope the shit show that ad model is crashes and burn to the ground. You can't use internet without having a solid armor on you with uBlock Origin and/or NoScript (or PiHole if you want the same readable experience on rest of your house devices).

naillo · Answer

Never realized how little data it was fed with. 570GB can fit on my laptop.

voisin · Answer

> Can this start breaking the ad-based model of the internet, where a lot of sites rely upon the ad income to run servers?Hopefully. This would be the best outcome I can think of for the Internet.

christkv · Answer

Anybody have an idea of what kind of hardware (and cost) you would need to train the model and to execute it ?Obviously storage is not a major factor here.

throwaway8829 · Answer

On that argument, I could see publishers trying to sue. If you ask GPT:
> What's the New York Times scrambled egg recipe?
GPT returns the exact recipe. If I were NYT I'd be frustrated. Their content is now showing without the ad views or paywall.

Pigalowda · Answer

This reminds me a bit of the criticism of &ldquo;black box logic&rdquo; for ML models.Is there something analogous to saliency maps for LLM?

raydiatian · Answer

My feeling is that one of the four happen:
1) AI is open sourced and we adapt stably. Either everybody has the opportunity to be their own business, or there is UBI.
2) AI is open sourced but it is unfairly distributed. Only some people are suited to BTOB, and/or UBI is shit.
3) AI is not open sourced, the wealthy edge out mankind and a planet scale genocide occurs.
4) none of it matters because the looming war between the US & China explodes or climate change wipes us out in any meaningful capacity that could pursue AI.
Given the track record of our species, #1 feels like wishful thinking

bee_rider · Answer

The ad-based model of the internet is bad anyway. I don&rsquo;t think chatGPT will break it, but we can hope!

omernomer · Answer

I think it should reference the sources of the information, similar to any research paper or essay.

bilsbie · Answer

Why should it be less fair than what a search engine does?It&rsquo;s really just building a better model.

hamburga · Answer

Maybe unfair is the wrong word. I think most agree that scraping, even at a massive scale -- is in itself fair. But is it sustainable?Will LLMs drive interest/activity away from wikipedia.org? Will it put its own sources of high-quality ad-supported content -- wikihow.com, for example (though I can't be totally sure it scraped from there) -- out of business? Or is there an earth-shattering copyright suit against OpenAI in the works as we speak?> Can this start breaking the ad-based model of the internetIs the alternative that everything is behind some kind of paywall by default, to block scraping? Is that where we're heading?

dd36 · Answer

Do you want companies to do this in private for private gain and not share it with you? Because making it illegal will just make it happen in greater secrecy.

sublinear · Answer

I agree. ChatGPT should cite its sources.

paulcole · Answer

Yes, it&rsquo;s 100% unfair but the net gain to society will be worth it in the long run. Got to break some eggs to make an omelet.

afrcnc · Answer

yes... and sometimes it's straight out copyright theft

sourcecodeplz · Answer

It's your fault for making your IP free and public. Instead of posting for free on your web property, do it in a book that you charge for.

cebert · Answer

Is it unfair that to present coworkers thoughts you summarized or derived after reading ad-supported content?

jhoelzel · Answer

Only if you learning the same things is cheating too."Copyright" "ingenuity of thought" etc are concepts that need to be overhauled since a lot more people now have access to higher education.

mikewarot · Answer

How could training an AI on the works of Shakespeare possibly be unfair to him? Or to any other long dead person? - I don't see any issues
How could training an AI on the works of someone who has already been paid for them be unfair? - Possibly because it effects their future marketability and income?
Current authors, artists, internet commenters, clearly have an interest in the results of their creative endeavors being used for gain that they won't benefit from. This is very similar to the extractive monopolies of YouTube and the rest of social media. Their profit at our expense.

Isn't ChatGPT unfair to the sources it scraped data from?

Absolutely, yes. It's incredibly unfair. But techbros here and elsewhere don't care about you or me or people in general and they'll think up an infinite amount of ridiculous false equivalencies before admitting the risks and real harms.

100% agree - the likes of ChatGPT are straight up generating revenue based on adding value to stolen work.

Well you could say same thing about the answers that Google displays on it's pages instead of search results! If you don't want these crawlers to index your content I am pretty sure you can disable via robots.txt just like Google.

Google piggybacks on the same sorts of data to rank results and display ads without compensating site owners. They track billions of people without paying them. I don’t think it’s any more unfair if OpenAI built a better product.

When this becomes properly entrenched I fear that it may create a disincentive to create original content. If that happens we will all be poorer for it in return for amazing access to what we already have. I don't think it is a good deal.

At the minimum, systems like chatgpt should be forced to link to their sources, so it gives something back and so that its assertions can be verified - right now, it’s just good at bullshitting through questions.

It's akin to saying that everyone who writes a book today must give credit to everyone who contributed to the creation of modern written language and printing tools.

I think the copyright ethicality of the current class of AIs is about like religion or guns.
Discussion is pointless because everyone already has an opinion and it's very firm.

"your scientists were so preoccupied with whether or not they could, they didn't stop to think if they should."

The data it was scraped from was then put into vector maps and usedd to create a model which is used from zero to create unique sentences that summarize what the model relates to. The text results coming out are neither copyright infringement nor plagiarism.

There needs to be a “raw source” option that puts links to everything the model spits out that the user can enable or disable. This can give credit to whatever the model cites, and also help us understand a little of how it’s linking things together.

Not any more than I don't need to keep paying my teacher once I learn what she knows. ChatGPT's value isn't in what it knows: it's in what it understands from your prompt in terms of that sea of information.

Never realized how little data it was fed with. 570GB can fit on my laptop.

> Can this start breaking the ad-based model of the internet, where a lot of sites rely upon the ad income to run servers?
Hopefully. This would be the best outcome I can think of for the Internet.

Anybody have an idea of what kind of hardware (and cost) you would need to train the model and to execute it ?
Obviously storage is not a major factor here.

On that argument, I could see publishers trying to sue. If you ask GPT:
> What's the New York Times scrambled egg recipe?
GPT returns the exact recipe. If I were NYT I'd be frustrated. Their content is now showing without the ad views or paywall.

This reminds me a bit of the criticism of “black box logic” for ML models.
Is there something analogous to saliency maps for LLM?

The ad-based model of the internet is bad anyway. I don’t think chatGPT will break it, but we can hope!

I think it should reference the sources of the information, similar to any research paper or essay.

Why should it be less fair than what a search engine does?
It’s really just building a better model.

Do you want companies to do this in private for private gain and not share it with you? Because making it illegal will just make it happen in greater secrecy.

I agree. ChatGPT should cite its sources.

Yes, it’s 100% unfair but the net gain to society will be worth it in the long run. Got to break some eggs to make an omelet.

yes... and sometimes it's straight out copyright theft

It's your fault for making your IP free and public. Instead of posting for free on your web property, do it in a book that you charge for.

Is it unfair that to present coworkers thoughts you summarized or derived after reading ad-supported content?

Only if you learning the same things is cheating too.
"Copyright" "ingenuity of thought" etc are concepts that need to be overhauled since a lot more people now have access to higher education.

Isn't ChatGPT unfair to the sources it scraped data from?

Absolutely, yes. It's incredibly unfair. But techbros here and elsewhere don't care about you or me or people in general and they'll think up an infinite amount of ridiculous false equivalencies before admitting the risks and real harms.

100% agree - the likes of ChatGPT are straight up generating revenue based on adding value to stolen work.

Well you could say same thing about the answers that Google displays on it's pages instead of search results! If you don't want these crawlers to index your content I am pretty sure you can disable via robots.txt just like Google.

Google piggybacks on the same sorts of data to rank results and display ads without compensating site owners. They track billions of people without paying them. I don’t think it’s any more unfair if OpenAI built a better product.

When this becomes properly entrenched I fear that it may create a disincentive to create original content. If that happens we will all be poorer for it in return for amazing access to what we already have. I don't think it is a good deal.

At the minimum, systems like chatgpt should be forced to link to their sources, so it gives something back and so that its assertions can be verified - right now, it’s just good at bullshitting through questions.

It's akin to saying that everyone who writes a book today must give credit to everyone who contributed to the creation of modern written language and printing tools.

I think the copyright ethicality of the current class of AIs is about like religion or guns.Discussion is pointless because everyone already has an opinion and it's very firm.

"your scientists were so preoccupied with whether or not they could, they didn't stop to think if they should."

The data it was scraped from was then put into vector maps and usedd to create a model which is used from zero to create unique sentences that summarize what the model relates to. The text results coming out are neither copyright infringement nor plagiarism.

There needs to be a “raw source” option that puts links to everything the model spits out that the user can enable or disable. This can give credit to whatever the model cites, and also help us understand a little of how it’s linking things together.

Not any more than I don't need to keep paying my teacher once I learn what she knows. ChatGPT's value isn't in what it knows: it's in what it understands from your prompt in terms of that sea of information.

Never realized how little data it was fed with. 570GB can fit on my laptop.

> Can this start breaking the ad-based model of the internet, where a lot of sites rely upon the ad income to run servers?Hopefully. This would be the best outcome I can think of for the Internet.

Anybody have an idea of what kind of hardware (and cost) you would need to train the model and to execute it ?Obviously storage is not a major factor here.

On that argument, I could see publishers trying to sue. If you ask GPT:> What's the New York Times scrambled egg recipe?GPT returns the exact recipe. If I were NYT I'd be frustrated. Their content is now showing without the ad views or paywall.

This reminds me a bit of the criticism of “black box logic” for ML models.Is there something analogous to saliency maps for LLM?

The ad-based model of the internet is bad anyway. I don’t think chatGPT will break it, but we can hope!

I think it should reference the sources of the information, similar to any research paper or essay.

Why should it be less fair than what a search engine does?It’s really just building a better model.

Do you want companies to do this in private for private gain and not share it with you? Because making it illegal will just make it happen in greater secrecy.

I agree. ChatGPT should cite its sources.

Yes, it’s 100% unfair but the net gain to society will be worth it in the long run. Got to break some eggs to make an omelet.

yes... and sometimes it's straight out copyright theft

It's your fault for making your IP free and public. Instead of posting for free on your web property, do it in a book that you charge for.

Is it unfair that to present coworkers thoughts you summarized or derived after reading ad-supported content?

Only if you learning the same things is cheating too."Copyright" "ingenuity of thought" etc are concepts that need to be overhauled since a lot more people now have access to higher education.

I think the copyright ethicality of the current class of AIs is about like religion or guns.
Discussion is pointless because everyone already has an opinion and it's very firm.

> Can this start breaking the ad-based model of the internet, where a lot of sites rely upon the ad income to run servers?
Hopefully. This would be the best outcome I can think of for the Internet.

Anybody have an idea of what kind of hardware (and cost) you would need to train the model and to execute it ?
Obviously storage is not a major factor here.

On that argument, I could see publishers trying to sue. If you ask GPT:
> What's the New York Times scrambled egg recipe?
GPT returns the exact recipe. If I were NYT I'd be frustrated. Their content is now showing without the ad views or paywall.

This reminds me a bit of the criticism of “black box logic” for ML models.
Is there something analogous to saliency maps for LLM?

Why should it be less fair than what a search engine does?
It’s really just building a better model.

Only if you learning the same things is cheating too.
"Copyright" "ingenuity of thought" etc are concepts that need to be overhauled since a lot more people now have access to higher education.