Once a model has been trained, the totality of it's knowledge is presumably encoded in it's weights, architecture, hyper-parameters, and so on. The size of all of this presumably being measurable in terms of number of bits. Accepting that the total "useful information" encoded may come with caveats about how to effectively query the model, in principal it seems like we can measure the amount of useful information that's encoded and retrievable from the model.
I do sense a challenge in equating the "raw" and "useful" forms of information in this context. An English, text-only wikipedia article about "Shitake Mushrooms" may be 30kb but we could imagine that not all of that needs to be encoded in an LLM that accurately encodes the "useful information" about Shitake mushrooms. The LLM might be able to reproduce all the facts about Shitakes that the article contained but not be able to reproduce the article itself. So in some ontologically sensitive way, the LLM performs a lossy transformation during the learning and encoding process.
I'm wondering what we know about the data storage characteristics of the useful information encoded by a given model. Is there a way in which we can measure or estimate the amount of useful information encoded by a LLM? If some LLM is trained on Wikipedia, what is the relationship between the amount of useful information it can reliably reproduce versus the size of the model relative to the source material?
In the case of the model being substantially larger than the source, can I feel metaphorically justified in likening the model to being both "tables and indices"? If the model is smaller than the source, can I feel justified in wrapping the whole operation in a "this is fancy compression" metaphor?
To convert a generative model into a compression algorithm, you just use arithmetic coding: https://en.wikipedia.org/wiki/Arithmetic_coding.
To convert a compression algorithm into a generative model, you assign a probability to each piece of data according to the size of its compressed representation.
See also the Hutter Prize and associated FAQ: http://prize.hutter1.net/
If you wanted to specifically measure the "useful" information, you would need to have some way of sampling from the set of possible articles that contain the same "useful" information, but vary in the "useless" information, and vice versa. I think you would find that it would be difficult for you to define what the boundary is, but if you made some arbitrary choice, you could measure what you are looking for through the LLM probabilities.
https://en.wikipedia.org/wiki/Hutter_Prize
GPT-3 is said to have 175 billion parameters, if those are float32s (I bet they could get away with less than that) it would be 700 GB of data. It's also said in Wikipedia that "60% percent of the weighted pre-training dataset for GPT-3 comes from a filtered version of Common Crawl consisting of 410 billion byte-pair-encoded tokens"
That would be about 680B tokens, say the average token is 5 characters, that is 3400B characters of text, such that the output is "compressed" to 20% of the input, which state-of-the-art text compressors can accomplish.
Now my figures could be off, namely they might be coding the parameters more efficiently and the average token could be longer. But it seems to make sense that if you trained a model to capture as much information as you could possibly capture out of the text it would be that size. Given that that kind of model seems to be able to spit out what it was trained on (though sometimes garbled) that might be about right.