📣 metalwhale

How LLMs Learn to Be "Generative"?

AFAIK to be called "generative", a model should have the ability to learn the joint probability over the training data. In the case of LLMs, we apply the chain rule of Bayes' formula to achieve this by leveraging the autoregressive method for every token of each input text sequence. For example, with a text sequence of 4 tokens, it can be written as:
p(x4,x3,x2,x1) = p(x4|x3,x2,x1) * p(x3|x2,x1) * p(x2|x1) * p(x1) where x1 denotes the 1st token, x2 denotes the 2nd token and so on, respectively.
I understand the conditional terms p(x_n|...) where we use cross-entropy to calculate their losses. However, I'm unsure about the probability of the very first token p(x1). How is it calculated? Is it in some configurations of the training process, or in the model architecture, or in the loss function?
IMHO, if the model doesn't learn p(x1) properly, the entire formula for Bayes' rule cannot be completed, and we can't refer to LLMs as "truly generative". Am I missing something here?
I asked the same question on nanoGPT repo: https://github.com/karpathy/nanoGPT/issues/432, but I haven't found the answer I'm looking for yet. Could someone please enlighten me? Thank in advance!

👤 yorwba Accepted Answer ✓

p(x1) is p(x_n|...) with n = 1 and ... empty. It's computed and optimized the same way, nothing special about it.

Web Analytics Made Easy - Statcounter