I read the relevant papers, went through many explanations of how transformers work.
Often those explanations spend thousands of words to explain attention at the word level, and then just say a few words about "oh and with multiple attention heads, it focuses on different aspects, and then multiple layers, and then, magic!".
What's happening in those other aspects, what are they? Are there papers that peruse what kind of concepts the model is actually building/learning in those heads and layers?
There are large teams who spend months tuning those models. Do those teams have access to those internal concepts that the model built up and organized? Is any of this work public?
In computer vision and CNNs, I recall seeing a paper once that showed that each layer of the network was learning a higher level feature than the layer before it (as an inaccurate example: first layer learns edges, second layer learns shapes, third layer texture, forth layer objects, etc, and they show you the eigenvectors of each as representatives).
E.g. I asked ChatGPT to tell me a joke about a table in a sundress in the voice of a famous stoic person. And by its response, it adequately "understands" what that person's style sounds like, basic humor, the concept of clothing and mapping that to an inanimate object (punchline: "I figured if a chair can wear a seat cushion, why can't I wear a sundress?"),...
(Obviously this is a tame example, but serves its purpose for the discussion).
> There are large teams who spend months tuning those models. Do those teams have access to those internal concepts that the model built up and organized? Is any of this work public?
See: https://openai.com/research/language-models-can-explain-neur...
My understanding: Generally, the models are compressing their understanding of all text, and in doing so, they're learning high order concepts that allow their compression of all the text they were fed during pre-training to be a better compression - more compressed, and less loss.