HACKER Q&A
📣 IYasha

Is there a license to protect my code and art from AI?


I'm looking for a way to protect my GPL/MPL/other code and CC game artwork against ML that would ingest it to produce other work. The only exception is when someone uses maps to train AI-driven bots - that's fine. So, I'm looking into some standardized way to tell my work is for human use only. Ideally, also in automated way, like prefixing headers with something like /* */.


  👤 bigB Accepted Answer ✓
My rather unpopular view is that if its available for humans to read on the internet, there is nothing different than a machine reading it, other than scale. There is nothing stopping a human from reading your code and using it themselves, or at the very least adapting it for themselves. Unless its a copy or contains actual code you have written then Im not sure you can actually defend it....other than having it locked down on the internet. Of course none of this is tested in court, and my feeling is that once it is, we will end up with a system similar to the robots.txt or maybe even the ML companies forced to attribute work, which would be a major pain for them. Personally Im not sure where I stand with what is right or wrong as I can see both arguments...for example Artists regularly take inspiration from closely studying other artists, producing works from them which is often similar or the same style. Yet this is seen as perfectly ok in the Art world, as long as its not a copy, much of the time they dont even need to say who inspired them. How is this different from AI doing the same thing other than it not being human ? On the other hand someone sees a company making money from something that looks inspired from their art, with no credit, I can also understand how they might feel...its going to be very interesting how this plays out, and at this point either side could win

👤 chii
The right to restrict learning from the literary work is not a right that is granted under copyright.

You can make your code something that cannot be _reproduced_ without a license, but you cannot stop someone from learning ideas from your published code. This does not require a license (only distributing, or making derivative works needs licenses).

Whether that learning comes via ML, or a real flesh human brain, i think, makes no distinction. You will need to lobby for an update to copyright laws to add a new right to be granted.


👤 formerly_proven
The legal theory used by ML companies that allows them to use any and all data a crawler can reach on the internet rests on the idea that training ML models with data is always fair use. ML companies thus argue they are entitled to ignore such directives, as those would be based around copyright licensing, and according to their own legal theory that's unnecessary for them.

And just to be clear, a theory is all this is. It has never been tested in court.


👤 iinnPP
I see this issue as the same issue seen by the MPAA when looking at piracy.

The tech is there.

It won't ever leave.

If you cut off one head, three will replace it.

In my opinion: Copyright is a failure and the absolute best move, with the advantage AI has given to everyone, is to abolish copyright.


👤 visarga
The tag is a nice idea, maybe #noml social networks. But its efficacy depends on the good will of those who create the datasets. Another approach could be a NOML registry for art and code.

But practically what you need to do is to stop crawlers from reading your code/art by robots.txt or captcha. If your works don't get into CommonCrawl and similar datasets they won't be used for model training. I think you can still enable Google Bot while rejecting AI data collectors on your sites.

I think copyright refers only to expression, not the ideas themselves. So training on copyrighted code and art should be ok as long as expression is not copied.

In these cases where an AI developer wants to train on the ideas without learning the expression they can re-generate the data using the "variations" method, works both in image and text. This will create substitute data, like anonymising PII.


👤 jraph
I also want my work protected from being used to train models that will produce code where my attribution and redistribution conditions are not respected. In particular, I don't want my work to help building proprietary software. Or I want to at least be paid for this. I explicitly use (A)GPL so people need to make their work available to their users, with the relevant rights to adapt and redistribute, if they want to use mine.

However, I'd rather not see a separate license / mechanism for this, because now we would have people who'd be fine with their work being used this way, people like us who are not fine with this and people who don't know / care. And mixing code from people of these different groups, which the licenses you cited allow, is going to be a mess.

I also would like that this not be opt-out, but opt-in.

Eventually, we need the legal system to do its work quickly and tell us if fair use can be used to train ML models and in which conditions, so we can build a strong defense.


👤 presspot
The stated purpose of copyright law is to promote the Progress of Science and the Arts.

Congress decided to grant limited "rights" to copyright holders as incentives for them create, not to protect their work forever, but in exchange for it to be widely available and eventually in the public domain.

That these incentives take the form of limited protections is a side effect. That the original purpose is often corrupted and delivered ham-handedly has more to do with politics than purpose.

If we argued from first principles, we could invent a better system, but that system would undoubtedly allow for ingestion and manipulation by AI.


👤 im3w1l
If you put code online publicly-readable then there is a signicant probability that it is allowed to train ML on it. So the first step you need to take is not doing that. You must ensure that anyone that reads your code has first signed an agreement not to train ML on your code.

If like GPL you want to allow people to share your code further, and make derivative works, then you must ensure that those people lock such works behind restrictions in the same way, such that derivative works can also only be read after signing the license.


👤 qikInNdOutReply
You can obfuscate your codebase, forcing the ML to generate endless amounts of intangible boilderplate.

You can add malevolent code to your code base, which allows at best for the ml project to gain self-awareness (copilot "shodan") and at worst to just add maleware. Of course you then

Dont forget to remove the evil pre-compilation. If you do art, i think the best think to throw AI would be fractal details, aka your picture never ends upon zoom in, but just becomes more art. That or you try to throw the weights in another way.


👤 vfistri2
Would be interested in art copyright too. But honestly I have no idea how to prove that ML model was trained on my art.

👤 lamuswawir
Thought of the same thing with voice. Can there be any limits on voice reproduction by AI?

👤 SinisterAlex
One a sidenote - is it only Github that openly states that AI is trained on the code you publish there? What about GitLab, GitBucket etc...

👤 foolrush
Resisting the structure requires an epistemic shift; resist applying a license.

👤 MikusR
Define human.