ie, can anyone grab a similar LLM and just train it on more data for it to be more valuable, or is a significant amount of value created with the LLM itself?
Training is a fairly mechanical operation-- you lay out the architecture, decide on an encoding/weighting scheme, then feed the training data through the net. Not a lot of secret sauce there. (As evidenced in the explosion in competing LLMs after the GPT papers were published)
Going forward, the copyright status and the possible need to pay licensing fees when training a neural net on (say) a book will be hot litigation topic. Before 2019 this just was not a thing at all. ("Fair use" covered anything done to a book in your possession. After a library buys a book, do they pay an additional fee every time a patron opens it?)
Read "Training Compute-Optimal Large Language Models" paper for some mediations on the tradeoff between model size and training data volume: https://arxiv.org/pdf/2203.15556.pdf