HACKER Q&A
📣 caprock

Should greater good ever override copyright with large language models?


A hypothetical. A thought experiment.

Imagine a scenario not too different from what's happening today. An entity of some sort is able to train super large scale models on language and code. The inputs are all our code on github and elsewhere, as well as all our writing on the internet. The difference from today, is that these models would be released to the public for all definitions of free. Maybe there are even advances which allow these models to efficiently run on consumer grade hardware. No commercial or governmental entity can reserve the models for private gain.

There are currently lots of concerns around the copyright and licensing of the input data for training these models. Should we consider dropping copyright/licensing claims in order for society to leverage some kind of greater good these models might invoke? Would you be willing to do so as an individual? Should societies do so as a whole?


  👤 IceMetalPunk Accepted Answer ✓
I think the idea of copyright being applicable to training data in the first place is absurd. The point of copyright isn't supposed to be "no one can ever use my things for anything", it's "no one can take my work and claim they made it, or give people access to it without asking me first". When data is given to a neural network for training, that is neither of those cases (training an AI on someone's work doesn't reproduce that work, especially when the training set contains tons of other peoples' work; and if your work is accessible enough to become training data in the first place, then at the very least you have given access to it for the person using it as training data). So trying to claim copyright prevents your work from being used as AI training data is a total perversion of copyright intent.

Copyright was always meant to be a protection against credit theft and others profiting by reproducing your work without your permission. It has largely become a capitalistic troll playground where "you've used my work anywhere" and "this seems kind of like my work" are treated as grounds for legal action. And that's messed up.