What's the value ratio of GPT3's LLM vs. its training data?

Question

I'm curious what the real value of GPT3 is. Do they have a unique LLM or is it the amount of data it's trained on that makes it valuable?ie, can anyone grab a similar LLM and just train it on more data for it to be more valuable, or is a significant amount of value created with the LLM itself?

sbierwagen · Accepted Answer

If the number of bits in the net is smaller than the number of bits in the training data, then training is always going to be a "lossy" operation, losing data in the strictest sense. So from that perspective, the source text is "more" "valuable".
Training is a fairly mechanical operation-- you lay out the architecture, decide on an encoding/weighting scheme, then feed the training data through the net. Not a lot of secret sauce there. (As evidenced in the explosion in competing LLMs after the GPT papers were published)
Going forward, the copyright status and the possible need to pay licensing fees when training a neural net on (say) a book will be hot litigation topic. Before 2019 this just was not a thing at all. ("Fair use" covered anything done to a book in your possession. After a library buys a book, do they pay an additional fee every time a patron opens it?)
Read "Training Compute-Optimal Large Language Models" paper for some mediations on the tradeoff between model size and training data volume: https://arxiv.org/pdf/2203.15556.pdf