HACKER Q&A
📣 albert_e

Will AI-generated images flooding the web pollute future training data?


We are seeing tons of AI-generated images from Dall-E, StableDiffusion, Midjourney, etc. flooding the internet.

This will only increase.

Not every such images has a distinct mark that denotes it as AI-generated. They could be mistaken for real photograph or real work of (digital) art by a human. Especially by an algorithm.

I also understand a lot of today's cutting-edge models are trained on images scraped from the web. Not sure what curation happens but it cannot be foolproof.

Will future AI models that generate "realistic" images feed on this as input and generate images that mimic some of these attributes -- creating some kind of feedback loop that will eco for generations of models?

Has anyone already thought of such issues -- not just with images but with AI-generated text, data, music, etc.

Curious to know what is the thinking of this group here.


  👤 pjc50 Accepted Answer ✓
Yes - and not only that, but AI-generated art will start affecting how humans make art, and possibly even how they take photos. I wouldn't like to predict how, though.

Especially with text there's an arms race to make undetectable AI text for blogspam and similar purposes. It's going to end up like carbon dating: once nuclear weapons were used in the atmosphere, everything ended up contaminated and had to be accounted for. https://www.radiocarbon.com/carbon-dating-bomb-carbon.htm

The future will include humans claiming AI art as their own, possibly touched up a bit, and AIs claiming human art as their own.


👤 jcims
I'm more worried about us becoming numb to visual novelty.

I was tinkering with Stable Diffusion yesterday to come up with ideas for an apartment interior design. These aren't even cherry picked, you can generate images like this, one every 10 seconds or so, for as long as you like:

https://imgur.com/a/mczYfnv


👤 iammjm
I am worried about information dumping -- just flooding the Internet with insane amount of AI generated data to drown the "real" data. And I dont mean "real" necessarily in a sense of genuine or man-made, but simply fake or slightly divergent. Imagine a world where only 1 out of 1000 news stories or tweets is real... the amount of damage you can do to institutions, democracies, causes... just continue fucking with people to the point where they have no idea what is real and what is fake and give up the truth-seeking altogether and give in to the loudest/most dominant narrative. Invest money into a an AI farm that just spits a fake every second and see if the fact-checkers and well-researched alternatives keep up.

👤 sinecure
The most valuable thing online in the next few decades will be authenticity. Authenticity is really the new luxury. The beauty of the early internet was its pure, passion driven authenticity. Websites sprouting up for every interest, built only because someone was driven to share their thoughts on a given subject. Forums, filled with techies chatting about their interests. Video games exploring interactive media and forming a new art. The rise of memes from places like 4chan, that have come to dominate digital expression.

All of these beautiful things have been degraded by the inauthentic, focus-group, advertising data harvesting machines of mega-corporate greed. Unique websites and blogs are drowned out into oblivion, unprofitable and hidden by the SEO Gods of Google, funneling you into their own products and advertising pathways. Forums bled out into Reddit, which is now an astroturfed corporate dream world where advertisers can masquerade as real users and corporate appointed moderators funnel all conversation into the optimum advertising framework--deleting anything that could harm reddit's shareholder pool of giant corporations and governments. Video games went from novel, artistic experiments, to hyper-optimized addiction machines built to drain the time, money, and drive from their young audience. Even memes, with all their raw vulgarity and juvenile silliness, have been coopted by corporations trying to bend this new form of expression to their advertising goals.

First, content online was authentic and human. Then, big tech started trimming and censoring and funneling and optimizing it into something less real... less human, but far more ripe for advertising revenue and data collection. Now, we are entering the stage of AI-generated content. Articles written by algorithm, art created by machine, bots filling up the whole internet with noise. The level of distrust, paranoia and questioning of reality that users will experience online in the coming years will be unparalleled. Is this image real? Is this person I'm talking to a bot? Is this artwork human made?

Which brings me back to my main point. Authenticity will be the new luxury. And the builders of tomorrow who figure out how to curate authentic online communities and experiences will be the winners in this content war.


👤 bee_rider
Perhaps someone will (or maybe it has already been done) figure out a business model for selling access to curated datasets that are known not to include a bunch of additional ML generated noise.

Although, to some extent I wonder how much it matters. If we're creating images using AI tools, and then sharing the best results, doesn't that become valid training data? In some sense are we supervising the learning?


👤 pengstrom
Of course, as with all human cultural endeavours. The images published are also selected by human aesthetics and interest, so valuable novel information is still being propagated.

👤 CTDOCodebases
I think the best analogy is hip-hop music.

Everything will become a copy of a copy of a copy until everything just sounds the same and looks like a parody of itself lacking the soul that made it attractive in the first place.


👤 liveoneggs
Maybe everyone should stop mass-stealing media to "train their model"

👤 verdverm
This reminds me about Google's paper titled "Machine Learning: The High Interest Credit Card of Technical Debt"

https://research.google/pubs/pub43146/

Abstract: Machine learning offers a fantastically powerful toolkit for building complex systems quickly. This paper argues that it is dangerous to think of these quick wins as coming for free. Using the framework of technical debt, we note that it is remarkably easy to incur massive ongoing maintenance costs at the system level when applying machine learning. The goal of this paper is highlight several machine learning specific risk factors and design patterns to be avoided or refactored where possible. These include boundary erosion, entanglement, hidden feedback loops, undeclared consumers, data dependencies, changes in the external world, and a variety of system-level anti-patterns.


👤 seqi
If a human cannot distinguish between a real image and an AI-generated image and the AI-generated images are manually labelled, then there is no problem?

👤 sebastianconcpt
It's a great question that brings me to two follow up questions:

1. The actual pollution happens in culture (our imaginary) and, as history shows, censorship (cultural via cancellation, or legal via politization of the issue) is not a moral nor practical solution. Then, high culture and filtering technology to the rescue?

2. Images are just the start as Murphy's Law ensures we'll face this same problem for every categorizable piece of knowledge you can think of using in an AI artifact (music, patterns of movements, speech recognition, behavior recognition, art recognition, etc)


👤 alenrozac
Possibly, but should be easy enough to exclude.

For example Dall-e has the pixels watermark (bottom-right), and I assume there's a possibility of an indicator that may by hidden in the data itself. One could also exclude common meme formats and their derivatives. Then there's the option of mapping to produced content via hashing a la Shazam, or have a discriminator component etc.

But you're right, it's not trivial. I just don't think it's too big of a deal.


👤 peaslock
While automatic data collection might become harder, one can still curate high quality datasets. An example is Tesla's FSD autopilot which is nearly entirely trained on curated data (AFAIK), as well as highly realistic 3D simulation data. Sure, it is expensive, but the expected returns are also very high.

However, there is some evidence that NNs currently are somewhat limited by the availability of high quality data,[0] however I'm not sure this is really a problem because neural nets already accomplish amazing things, so one might not need that much data to get something useful (perhaps at the expensive of more compute, but so what; e.g. analog computing might give some 1000x speedup anyhow).

[0] https://www.lesswrong.com/posts/6Fpvch8RR29qLEWNH/chinchilla...


👤 shmatt
One thing I've noticed that no AI tooling company is building is a good/easy to use feedback loop. A high quality model needs a lot of human feedback. No one is really building a strong platform to do that. But eventually someone will have to fill that gap

👤 raphar
With all these ai generated images, the only thing I don't want is the younger ones quiting drawing, ditching the pencil for an app, loosing the motivation for hand drawing because using these programs are easier and faster. It would be a sad world.

👤 CarbonCycles
I think it's going to become a cat and mouse game. AI generated images (e.g., deep fakes) are already being used in very nefarious ways such as job interviews, applying for gov't documents via video, etc.

Researchers are finding ways to identify the tale-tell markers that currently give them away, but yes, for the neophyte this is going to be a real issue on what can I trust.

However, the great thing is that you will always have data...the challenge will then become how well do I TRUST my predictions, which I believe will spur some very interesting algorithms such as anomaly detection (i.e., the RGB distributions, spatial-markers, etc are way too distorted if I compare metadata from other pictures of this type).


👤 sebastien_b
I’m hopeful it will ruin the business model of privacy rapists like Clearview AI.

👤 dougmwne
First, it’s probably quite possible to train a network to filter on AI vs. human generated. There are many visible signs for the untrained eye and probably many more that are not visible to humans but can be picked up from the correlation.

Secondly, there is still a big human selection process going on. Only the most interesting and coherent images will find their way onto the public internet. In fact, if you can automatically detect that these images are AI, then they can serve as an additional training signal to help teach the AI which of its outputs are most likely to delight the human.


👤 colmanhumphrey
It’ll make some naive approaches work differently, but it’s overall more info: both the selection effect of what images humans share, and the surrounding context (eg what comments are people making about the image)

👤 eru
> They could be mistaken for real photograph or real work of (digital) art by a human. Especially by an algorithm.

Perhaps by a naive algorithm. I'm fairly sure it's fairly easy to train a neural network to recognize current generation AI generated images. And probably for quite a while longer.

Btw, if you want to create realistic images, it's fairly easy to create guaranteed pristine data: just take a video camera and create some footage.

Perhaps there will even be a market for such pristine data.

Now, if you want to create art and train on human artists' output, that might perhaps get harder in the future.


👤 jacknews
LOL, Malkovich ... Malkovich, Malkovich.

https://www.youtube.com/watch?v=Q6Fuxkinhug

Starring Malkovich, as, Malkovich.


👤 WithinReason
It should be easy to train a discriminator that can tell the difference well enough. And if it misclassifies an image then using it for training should be fine :)

👤 benreesman
I’ll pose a mental model which is more than a bit simplistic, but I’ve watched big models in feedback loops before and I feel that the intuition is fundamentally sound.

There is sort of a fixed-point on this stuff that creates a Nash point. Every relevant move is a move for some advantage (from megacorp copyright laundering to aspiring influencer content output) and that competition tends to wash out roughly where you started.


👤 bluelightning2k
I think it's likely that AI can recognize AI generated images.

This would mean it's always likely to be filterable.

And if not, it's arguable that this pollution becomes an asset. It would be high quality synthetic training data which is commonly intentionally used.

It would also be possible to look for the metadata that accompanies photos taken on phones etc. and weigh that more highly.


👤 lazyjones
AI-generated media in general will likely plunge us into a new dark age. Every report you see on the news, every "secret recording" of a politician doing something dodgy, will be either an AI generated fake or considered as such by many viewers. Nothing will be certain anymore and MSM have already lost their credibility for many viewers.

👤 t_mann
Probably, but I don't think it'll be a huge problem. Firstly, we want to be able to identify generated images for much more important reasons anyway, and secondly, the importance of training data might decrease as those systems mature and move to less supervised training methods.

👤 Kuinox
People will post more generated images that they consider good, than bad, thus improving the dataset.

👤 Cthulhu_
I'm sure it has been done before, but, has anyone ever generated a set of machine learning generated anythings (images, text, whatever) and used it as the learning set for a new ML thing?

Then again, and again, and again, until we end up in the 10th dimension of AI surrealism.


👤 warrenm
I look forward to the "pollution", personally

This is not in denigration of human artists, photographers, etc

I think it'll drive artists to make better/different work (just like the advent/adoption of the vanishing point changed art ~500-1000y ago)


👤 XorNot
I don't think so - they'll probably ensure enhance it. An AI draws something, people describe what they actually see, the AI is now refined in what it should produce for a given prompt (and moves away from the original prompt).

👤 taubek
Yes, it will pollute the training data from todays perspective. In future, I don't think that people will care. It is a shame that there is no metadata embedded in AI generated images so that they can be distinguished.

👤 kybernetyk
No, just create an image service like Getty or Shutterstock where human made images are uploaded for the sole reason of selling those images to AI companies who want to train their networks with organic material.

👤 steve_john
Algorithms of AI can generate images or videos based on a set of parameters or it can create new images by combining and altering existing images.

👤 tarunmuvvala
I am curious how do these AI get the data sourced?

Is it the free floating images on the internet or is it the data that we keep on the servers of the BigTech?


👤 seqi
If a human would label the AI-generated image exactly the same as if it was a real image, then there is no problem?

👤 zecg
I think each AI generated image should have a steganographic signature with data on the system generating it.

👤 hiccuphippo
If they are mistaken for real photography then mission fucking accomplished. https://xkcd.com/810/

Jesting aside, if the resulting image is good enough to publish and tag, then it's good enough to put back into a training set.

I also assume there could be a market for AI trained only on "human produced art" the same way there's a market for organic vegetables.


👤 enviclash
Yes, in the same way that algorithms influence human decisions.

👤 seydor
there will be no longer "training data" , that is, large datasets. People will finetune their models to add specific new subsets of images

👤 hmate9
imo no. Just use existing image sets to train future models if this becomes too big of a problem.

👤 emporas
I think that all the future deep learning will be trained to fresh data, crowdsourced from billions of humans. Humans will be expected to photograph a door knob, from a distance of 20cm, a height of a meter from the floor etc.

If a billion humans take 50 photos like that, and spend fifteen minutes of their life to do so, we will have almost as much data as the Laion database, but for door knobs. The photo workers will be paid something like 0.00001 dollar for a picture, by the users of the deep learning algorithms.

The payment method is called blockchain and bitcoin if you have heard of such a thing. Bitcoin, the money of information will enable a marketplace of information, in which the better the information, the more the producer is paid. Bitcoin bsv, can support almost a million transactions per second as of today, and every year the tps is increasing tenfold.