HACKER Q&A
📣 lvnfg

Should AI trained on publicly available data be banned?


The creators DALL-E and Github Copilot -- and people in favor of training AIs on publicly available data in general -- argue that people have been studying and making derivative work since the dawn of history, so there should be no problems training these models on public art and OSS software. There are many IP and copyright laws and precedent cases addressing whether a derivative work is allowed, and the images and code snippets these AIs generate seem to satisfy all of them. They are trained on publicly available sources, are not identical copy, are not clearly attributable to any original author (unless explicitly prompted to), and the AI model is incapable of having clear intention of copying when the work is being generated. These are all valid points, and certainly the lawyers at Microsoft and OpenAI can argue in favor of them better than me.

The problem is these laws were created when creating derivative work was not scalable.

Just a year ago to create a painting you have to spend years learning how to paint, studying the works of the old masters, making and improving your own works. Each new paintings take less time than the last, but the hours dedicated to create new work, either a masterpiece or a book cover, is never insignificant. To create a variation take almost as much time. To change to a new style you have to start the learning process all over again. It is hard, but you put your personal time and efforts to it. If your work was inspired by others, as long as it is not a blatant copy, the original authors will certainly feel appreciated, even if they may not like it.

Now all it takes is a text prompt and a few clicks, and you can create images in any style, from any author, with as many variations as you want. Code snippets generated by Copilot is not currently in the same league, but I think we can all agree that it is only a matter of time before a whole project can be created with a requirement prompt, certainly in the same style as the author or authors of the individual packages the AI was trained on.

The difference is industrial scale harvesting of creative efforts. Somebody admiring your works and spend time and efforts trying to learn from you is okay. If they become famous and make a lot of money you will be credited as their inspiration and at least an influence on their success. An AI blindly scanning and blending and making mass copies of altered works to make a corporation rich is not; in the corporation's eye the original authors, from the old masters to the modern amateurs, are all no different than unpaid slaves producing textiles in a mill somewhere to be fed into a machine to create mass produced clothing.

Sure all the training data are taken from publicly available sources with permissive license, but those licenses are created when we still think creative arts will be the last AI frontier, which wasn't event that long ago. If you could have foreseen that the license you chose back then would allow a corporation to profit from your work, would you have made the same choice?

So I would argue that from now on, training AIs on publicly available data must be banned, except where the work is on public domain or where the license explicitly allow the work to be fed as training data for AI models. If a law is not explicitly passed, current permissive licenses such as Creative Commons should be revised to include a clause addressing this point and let the author decide whether to let AIs train on their work or not.

I feel like having such a law or licensing convention established as soon as possible will greatly benefit humanity in the long term. In the short term we all want more free content, but once AIs start taking over all jobs now performed by human, creative arts will be the only source of meaning left for us in the future.


  👤 vivegi Accepted Answer ✓
The solution lies in denying authorship rights (ie., natural rights of authorship) to programs. This will prevent claiming copyrights on AI-generated content.

A fundamental issue with granting authorship rights on AI/program generated content to the AI/programs is that what happens when two people use the same AI and generate two different works using similar prompts that also look similar? Are they each infringing on each other? This issue is unsettled.

The case law on this is evolving in various jurisdictions.

Case law from India:

Aug 2021: India recognises AI as co-author of copyrighted artwork https://www.managingip.com/article/2a5czmpwixyj23wyqct1c/exc...

Dec 2021: Indian Copyright Office issues withdrawal notice to AI co-author https://www.managingip.com/article/2a5d0jj2zjo7fajsjwwlc/exc...


👤 sigmaprimus
Regarding your last paragraph, it is possible that a new law or licensing convention will benefit some but I can not think of a law or licensing convention that currently exists that has greatly benefited humanity. In all the examples I can come up with, bureaucrats seem gain the most benefit by being the ones enforcing said regulations.

As far as AI taking over jobs, that argument has been made many times before with each new tech. Think of the farriers and buggy whip manufacturers displaced from the automobile, or the typing pools that vanished with fax machines and later on email.

More specific to your concerns; In the past, The printing press, photocopiers, cameras and many more were seen as a threat to the artist but the creative ones found a way to work with the new tech and did just fine.

I think your heart is in the right place, maybe leaning a bit further to the left than me, but in any case I wouldn't be too worried about the creative arts, when art becomes common and it's value is diminished, the creative people find a new way!


👤 muzani
First side effect in mind is Google. It scrapes public data and uses some AI in search.

But I think you might be on to something with derivative work, perhaps redefining derivative to include AI generated content. Shakespeare or Mona Lisa would be fine. It might allow training AI search engine spiders, but not training AI content generation.

What if someone writes Harry Potter fan fiction but puts an open license for that? AI might not be trained on Harry Potter itself, but it could be trained on millions of semi-legal fan fiction. That's frequently the source of X in the style of Y prompts.


👤 tarunmuvvala
I think banning not a great option. Banning will curb the creativity of the thinker. The problem is the we need to provide a mechanism which will enable the contributors in monetary and give a form of recognition too.

We need to find a way to enable benefits to all the data contributors. I find the Gitcoin way of developing a product as great way to engage community and enable folks with a equity enabled token . Something similar should happen with data providers too.

Foundations, NGO or public dataset orgs can just sell the token in the market to people who want to use the AI and benefit from it.


👤 onion2k
I think this would have some significant unintended side effects. For example, is there any real difference in creating an index of millions of pictures and using ML to combine them into a new picture to creating an index of millions of websites and using ML to combine them in to a search results page?

Without being very careful about how you regulate the use of 'public data' you could end up accidentally killing the internet.