DALL-E was trained on watermarked stock images?

Question

I just got a Dall-E render with a very intact "gettyimages" watermark on it. I'm no legal expert on whether you have to own the license to something to use it as training input to your AI model, but surely you can't just... use stock photos without paying for the license? Maybe I'm just old fashioned.Prompt: "king of belgium giving a speech to an audience, but the audience members are cucumbers"All 4 results (all no good as far as the prompt is concerned): https://ibb.co/gz5RDkBFullsize of the one with the watermark https://ibb.co/DzGR063

dlg · Accepted Answer

I am not a lawyer, but I've had to argue about copyright with several.
In the United States, there are two bits of case law that are widely cited and relevant: In Kelly v. Arriba Soft Corp (9th), found that making thumbnails of images for use in a search engine was sufficiently "transformative" that it was ok. Another case, Perfect 10 (9th), found that thumbnails for image search and cached pages were also transformative.
OTOH, cases like Infinity Broad. Corp. v. Kirkwood found that that retransmission of radio broadcast over telephone lines is not transformative.
If I understand correctly, there are four parts to the US courts' test for transformativness within fair use (1) character of use (2) creative nature of the work (3) amount or substantiality of copying (4) market harm.
I'd think that training a neural network on artwork--including copyrighted stock photos--is almost certainly transformative. However, as you show, a neural network might be overtrained on a specific image and reproduce it too perfectly--that image probably wouldn't fall under fair use.
There are also questions of if they violated the CFAA or some agreement crawling the images (but Hiq v Linkedin makes it seem like it's very possible to do legally) and whether they reproduced Getty's logo in a way that violates trademarks (are they trying to use it in trade in a way there could be confusion though?)

chrismorgan · Answer

All large-scale public machine learning stuff is depending on being exempt from copyright restrictions, under fair use doctrine. Look at my responses to all of the threads about Copilot + GPL for more info about that application of it: https://hn.algolia.com/?query=chrismorgan+copilot+gpl&type=c....When that is finally tried in court, if it fails to any meaningful extent at all (including going all the way up to Supreme Courts as it doubtless will), then Copilot is dead, DALL&middot;E is dead, GPT-3 is dead, all of these things will be immediately discontinued in at least the affected jurisdictions, at least until such a time as they get the laws changed or judgements overturned.

webwielder2 · Answer

These are the absolute worst DALL-E images I've seen. Do people generally just share the amazing ones and most of the output is actually complete shite? Like Instagram presenting the top 1% of people's lives.

BrainVirus · Answer

People here, as always, get hung up on legalese bullshit, but miss the overall picture.
The dynamics in play is highly questionable. Countless artists and photographers put effort into creating their works. They put they work online to get some attention and recognition. A company comes along, scrapes all of it and starts selling access to the model to generate something that looks highly derivative. The original cohort of artists and photographers not only get zero money or attention from this new endeavor, they are now in competition with the resulting model.
In short, someone whose work was essential to building a thing gets no benefits and possibly even gets (financially) harmed by that thing. Just because this gets verbally labeled "fair use" doesn't make it fair.
Additional point:
Just a few years ago a bunch of tech companies were talking about "data dignity". Somehow, magically, this (marketing) term is no longer used anywhere.

xg15 · Answer

Reminds me of the discussion about GitHub Copilot using the entirety of GitHub as training data. I was honestly baffled how many people, even experts in the field, saw use as training data as non-infringing. With the corrolay that it's apparently perfectly legal to "copyright-wash" a work by feeding it to an AI and have that AI generate a slightly different but extremely similar work.Considering how strict and heavy-handed copyright handling has been otherwise, this has added to my belief that copyright in practice is really just enforcement of the interests of whatever industry has the most power at a given time: When entertainment and content generation was the biggest revenue generator, copyright couldn't be strict enough, now all money is on AI and suddenly loopholes the size of barn doors pop up.

ShamelessC · Answer

> but surely you can't just... use stock photos without paying for the license?
They aren't hosting the infringing content. Training on the data is probably covered under fair use. Generations are of _learned_ representations of the dataset, not the dataset itself. This makes it closer to outputting original works (probably owned by the person who used the model).
The players involved here are known for being litigious, however. I wouldn't be surprised if OpenAI did in fact pay some hefty fee upfront to get full permission to use these images.

StillLrning123 · Answer

Kids in school are also trained on stock imageshttps://www.reddit.com/r/KidsAreFuckingStupid/comments/8tgxs...

sulam · Answer

I think it&rsquo;s amusing that many commenters here are perfectly willing to defend DALL-E, but mention Copilot and the discussion looks radically different.

cercatrova · Answer

Based on the new scraping ruling with LinkedIn [0], anything that is "open gate" (as in, accessible without logging in) can be scraped and (I assume) be used by neural networks. The onus, it appears, is to not use it to generate copyrighted works, like Iron Man from Marvel, just as one can use Photoshop as a tool but is still barred from making and selling an Iron Man digital painting.[0] https://cdn.ca9.uscourts.gov/datastore/opinions/2022/04/18/1...

im3w1l · Answer

I remember when people used to say ianal. Innocent times when we thought there was an objective law and lawyers knew it. But that's not how these things work. The truth is that no one knows. Ultimately a bunch of people will decide how they feel about it. Well-read legal scholars trying really hard to be fair, but still just people. No one can predict with full certainty which way it will go.

jcims · Answer

Legally wouldn't it just boil down to the license on the watermarked image?BTW you can add 'royalty free' to the prompt to get rid of those most of the time (lol?).

trention · Answer

My personal opinion is that it's unethical (and possibly illegal, in a subset of cases) to train models on data without explicit consent of the creators of that data. And that really encompasses all data - generative models were not a thing when said data was created and no matter how it was licensed before, explicit consent about using it for model training must be obtained from the creators themselves.That being said, arguments about copyright are just a fig leaf as far as I am concerned. The outcome of whether this is allowed or not will depend on the net impact of using those models on the job market and whether society will be willing to tolerate it.

gojomo · Answer

You may want to use the native 'Share' option, especially on the one with the watermark.You'll get a public link, at `labs.openai.com` rather than some random image-sharing site, which will show the image & the prompt used to generate it (including a credit to "your-first-name &times; DALL&middot;E").

RcouF1uZ4gsC · Answer

What is interesting is a human analogy.Say you were an artist who went to every art show and museum and studied all the art there.If you produced a work of art solely from memory that contained large portions of other people's copyrighted art, would that still fall under copyright/require licensing?

_trampeltier · Answer

If you read the licence from Getty, they say, you are not allowed to use Getty pictures for ML.

userbinator · Answer

This interesting era of AI will surely teach us the meaning of that old phrase "great artists steal", or more subtly rephrased, "everything is a derived work".

agnosis · Answer

Got the exact same girl from the picture in the ad at the bottom. Creepy! https://ibb.co/dBLNxQ6

Geonode · Answer

It doesn't matter. I could put a Getty watermark on anything. Getty would have to show that a generated image was at least in part the same as their image.

surfacedetail · Answer

I'm finding it amusing that everyone immediately assumes infringement, OpenAI is a company that will not be inviting lawsuits.
We can't assume any licensing behind closed doors, my guess is that OpenAI has an agreement with Getty, take a look at the licensing in this Observer piece, it's been licensed by Getty, this would indicate that Getty are happy with scraping.
https://www.theguardian.com/commentisfree/2022/aug/20/ai-art...
Besides, this is not infringement in principle, the AI has been trained to think that high-quality news images have watermarks.

registeredcorn · Answer

I don't care much for what laws say. If the only way someones service can work is by ingesting the work of someone else, without compensation, and then compete with that same person, that is wrong.
If a company reverse engineers a competitors product, they still buy the product to tear it apart and figure out how it works.
If a student learns from their teacher, then goes on to sell a similar kind of work as what their teacher makes, at least the student paid for the classes.
This arrangement offers none of that. As long as theft is illegal, this should be. I'd call it parasitic, but it isn't; this is a parasite who's sole intent is to kill the host.

coldtea · Answer

>but surely you can't just... use stock photos without paying for the license?You'd be surprised...

purpleblue · Answer

Is there a copyright protection in terms of consuming a copyright-protected image? I thought it was only for the purpose of displaying that image. If you're reading the file and reading the data, but not displaying it, is that also protected?

vivegi · Answer

Just wait until they build an AI watermark identifier and remover (which is a problem subset) and then use its output to train/update their model.
They probably already have specialized filtering models built to filter out censorable terms. They may be imperfect, but they are there. A watermark remover might be an easy addition.
When Stable Diffusion released their model playground, I used the prompt Peter at the pearly gates dressed as a security guard and got three images two of which were censored and one that was an ordinary image. So, the capability is there already. Just a matter of time before they get good at watermark removal.

severak_cz · Answer

Probably just some stock photos with watermark sneak in.There are lots of photos with watermark circulating on web, for example in memes and unfinished webpages (when finished, these will be replaced with paid variant without watermark).

davikr · Answer

Yeah, I've seen an image get generated with a very recognizable watermark for a certain stock image company. This happened with a totally unrelated prompt.

RobertoG · Answer

I don't know about the images, but what about the watermark itself? Can I just take any photo and add a proprietary watermark?

sva_ · Answer

Similar thing with GH Copilot. I'd say it is still fair use though, even though such things should be filtered out.

fxtentacle · Answer

Yes, Imagen and everything based on LAION 400M or 2B, too.
BTW, Copilot also ignored all licenses of the source code it memorized.
Datasets are the new capital. If they could, most employees would probably also object to their company using the result of their work to replace their job. But they can't. It's the same with artists here.

JacobiX · Answer

The first thing that I try after generating an image from DALL-E is using reverse image search. I do it on every image that I intend to use, more often than not, I find a very similar image, in this case I discard it and vary my prompts.

topicseed · Answer

What are the best apps and subscriptions to generate these? No private beta, just sign up, put a credit card on file, and use? (Low volume, perhaps 100 images per month, so 300-500 attempts.)Could be great for featured images for blog posts.

throwaway120983 · Answer

some people will post images with watermarks on social media or other sites with user generated content. if their dataset included images scraped from them, then it could have gotten in that way

angusturner · Answer

Relevant earlier discussion about this issue: https://news.ycombinator.com/item?id=32436203

inasmuch · Answer

Wondered the same thing recently &hellip; https://news.ycombinator.com/item?id=31159231

davidguetta · Answer

No one fucking cares. For 1 "copyrighted" image theres a thousand free with the same quality or almost.You are wasting CO2 even discussing it

zlqanst · Answer

Obviously you could send it to the copyright holder and find out. In the case of Copilot, Oracle certainly would sue.

humaniania · Answer

Seems more likely to me that they add uploaded images into their data set and someone uploaded a watermarked image.

JaceLightning · Answer

Educational is a fair use category. These tools advance science. I wouldn't expect them to respect copyright.

throwaway120983 · Answer

sometimes people will post stock images on sites with user generated content. if their training data included images scraped from those sites, then it could have gotten in that way unintentionally

Asmod4n · Answer

Last time i checked you can source from whatever you want, legislation doesn't care.The last time i checked it was when colpilot got public, they could have trained it only on gpl code. The source license/copyright et all don't matter.

tough · Answer

So what happens if I start selling Dali like pieces?

Cypher · Answer

You transformed the original enough so it's ok

snickerbockers · Answer

Regardless of whether or not training an AI on stock images violates the license, there's a very real problem with that watermark being present, which is that it proves their AI is prone to copying large swaths of images from gettyimages unaltered, and that definitely is a license violation.
This makes me think back to the controversy over github copilot; if these AIs are going to be trained on other peoples' IP then somebody needs to be held accountable when they commit plagiarism.
Otherwise, im sure Microsoft won't mind my new "gamemaker AI" that i trained on that new halo game last year, or this "OS AI" that I trained on windows 11.

yieldcrv · Answer

some people go into business models that simply have no legal protections

ratonofx · Answer

This copyright "issues" are against the true nature of innovation.
By the means of Artificial INTELIGENCE, we must to accept a mind or intelligence is free to perceive external elements and use every stimulus to execute its own creative process.
The world is a perpetual iteration cycle amongst human beings. Good artists borrow, great artists steal.

DALL-E was trained on watermarked stock images?

These are the absolute worst DALL-E images I've seen. Do people generally just share the amazing ones and most of the output is actually complete shite? Like Instagram presenting the top 1% of people's lives.

Kids in school are also trained on stock images
https://www.reddit.com/r/KidsAreFuckingStupid/comments/8tgxs...

I think it’s amusing that many commenters here are perfectly willing to defend DALL-E, but mention Copilot and the discussion looks radically different.

Legally wouldn't it just boil down to the license on the watermarked image?
BTW you can add 'royalty free' to the prompt to get rid of those most of the time (lol?).

You may want to use the native 'Share' option, especially on the one with the watermark.
You'll get a public link, at `labs.openai.com` rather than some random image-sharing site, which will show the image & the prompt used to generate it (including a credit to "your-first-name × DALL·E").