Makes me want to better understand this tech.
Edit: thank you for some amazing top level responses and links to valuable content on this subject.
First, we take a sequence of words and represent it as a grid of numbers: each column of the grid is a separate word, and each row of the grid is a measurement of some property of that word. Words with similar meanings are likely to have similar numerical values on a row-by-row basis.
(During the training process, we create a dictionary of all possible words, with a column of numbers for each of those words. More on this later!)
This grid is called the "context". Typical systems will have a context that spans several thousand columns and several thousand rows. Right now, context length (column count) is rapidly expanding (1k to 2k to 8k to 32k to 100k+!!) while the dimensionality of each word in the dictionary (row count) is pretty static at around 4k to 8k...
Anyhow, the Transformer architecture takes that grid and passes it through a multi-layer transformation algorithm. The functionality of each layer is identical: receive the grid of numbers as input, then perform a mathematical transformation on the grid of numbers, and pass it along to the next layer.
Most systems these days have around 64 or 96 layers.
After the grid of numbers has passed through all the layers, we can use it to generate a new column of numbers that predicts the properties of some word that would maximize the coherence of the sequence if we add it to the end of the grid. We take that new column of numbers and comb through our dictionary to find the actual word that most-closely matches the properties we're looking for.
That word is the winner! We add it to the sequence as a new column, remove the first-column, and run the whole process again! That's how we generate long text-completions on word at a time :D
So the interesting bits are located within that stack of layers. This is why it's called "deep learning".
The mathematical transformation in each layer is called "self-attention", and it involves a lot of matrix multiplications and dot-product calculations with a learned set of "Query, Key and Value" matrixes.
It can be hard to understand what these layers are doing linguistically, but we can use image-processing and computer-vision as a good metaphor, since images are also grids of numbers, and we've all seen how photo-filters can transform that entire grid in lots of useful ways...
You can think of each layer in the transformer as being like a "mask" or "filter" that selects various interesting features from the grid, and then tweaks the image with respect to those masks and filters.
In image processing, you might apply a color-channel mask (chroma key) to select all the green pixels in the background, so that you can erase the background and replace it with other footage. Or you might apply a "gaussian blur" that mixes each pixel with its nearest neighbors, to create a blurring effect. Or you might do the inverse of a gaussian blur, to create a "sharpening" operation that helps you find edges...
But the basic idea is that you have a library of operations that you can apply to a grid of pixels, in order to transform the image (or part of the image) for a desired effect. And you can stack these transforms to create arbitrarily-complex effects.
The same thing is true in a linguistic transformer, where a text sequence is modeled as a matrix.
The language-model has a library of "Query, Key and Value" matrixes (which were learned during training) that are roughly analogous to the "Masks and Filters" we use on images.
Each layer in the Transformer architecture attempts to identify some features of the incoming linguistic data, an then having identified those features, it can subtract those features from the matrix, so that the next layer sees only the transformation, rather than the original.
We don't know exactly what each of these layers is doing in a linguistic model, but we can imagine it's probably doing things like: performing part-of-speech identification (in this context, is the word "ring" a noun or a verb?), reference resolution (who does the word "he" refer to in this sentence?), etc, etc.
And the "dot-product" calculations in each attention layer are there to make each word "entangled" with its neighbors, so that we can discover all the ways that each word is connected to all the other words in its context.
So... that's how we generate word-predictions (aka "inference") at runtime!
By why does it work?
To understand why it's so effective, you have to understand a bit about the training process.
The flow of data during inference always flows in the same direction. It's called a "feed-forward" network.
But during training, there's another step called "back-propagation".
For each document in our training corpus, we go through all the steps I described above, passing each word into our feed-forward neural network and making word-predictions. We start out with a completely randomized set of QKV matrixes, so the results are often really bad!
During training, when we make a prediction, we KNOW what word is supposed to come next. And we have a numerical representation of each word (4096 numbers in a column!) so we can measure the error between our predictions and the actual next word. Those "error" measurements are also represented as columns of 4096 numbers (because we measure the error in every dimension).
So we take that error vector and pass it backward through the whole system! Each layer needs to take the back-propagated error matrix and perform tiny adjustments to its Query, Key, and Value matrixes. Having compensated for those errors, it reverses its calculations based on the new QKV, and passes the resultant matrix backward to the previous layer. So we make tiny corrections on all 96 layers, and eventually to the word-vectors in the dictionary itself!
Like I said earlier, we don't know exactly what those layers are doing. But we know that they're performing a hierarchical decomposition of concepts.
Hope that helps!
I can't ELI5 but I can ELI-junior-dev. Tl;dw:
Transformers work by basically being a differentiable lookup/hash table. First your input is tokenized and (N) tokens (this constitutes the attention frame) are encoded both based on token identity and position in the attention frame.
Then there is an NxN matrix that is applied to your attention frame "performing the lookup query" over all other tokens in the attention frame, so every token gets a "contextual semantic understanding" that takes in both all the other stuff in the attention frame and it's relative position.
Gpt is impressive because the N is really huge and it has many layers. A big N means you can potentially access information farther away. Each layer gives more opportunities to summarize and integrate long range information in a fractal process.
Two key takeaways:
- differentiable hash tables
- encoding relative position using periodic functions
NB: the attention frame tokens are actually K-vectors (so the frame is a KxN matrix) and the query matrix is an NxNxK tensor IIRC but it's easier to describe it this way
An attention mechanism is when you want a neural network to learn the function of how much attention to allocate to each item in a sequence, to learn which items should be looked at.
Transformers is a self-attention mechanism, where you ask the neural network to 'transform' each element by looking at its potential combination with every other element and using this (learnable, trainable) attention function to decide which combination(s) to apply.
And it turns out that this very general mechanism, although compute-intensive (it considers everything linking with everything, so complexity quadratic to sequence length) and data-intensive (it has lots and lots of parameters, so needs huge amounts of data to be useful) can actually represent many of things we care about in a manner which can be trained with the deep learning algorithms we already had.
And, really, that's the two big things ML needs, a model structure where there exists some configuration of parameters which can actually represent the thing you want to calculate, and that this configuration can actually be determined from training data reasonably.
“After I woke up and made breakfast, I drank a glass of …”
In America one might say the most likely next words are “orange juice”, or “apple juice” but not “sports car” which has nothing to do with the sentence.
Ultimately this is what language models do, given a sequence of data (in this case words) predict the most likely next word(s).
For attention, when you read the sentence, which words stood out as more important? Probably woke up, breakfast, and glass while the words after, I, and made were less important to completing the sentence.
That is, you paid more attention to the important words to understand how to complete the sentence.
The “attention mechanism” in language models is a way to let the models learn which words are important in sentences and pay more attention to them too when completing sentences, just like a person would do as in the example above.
Further, it turns out this attention mechanism lets the models do lots of interesting things even without other fancy model techniques. That is “attention is all you need”.
And for a deeper dive, Andrej Kharpaty has this hands-on video[1] where he builds a transformer from scratch. You can check-out his other videos on NLP as well they are all excellent.
[0] https://youtu.be/rURRYI66E54, https://youtu.be/89A4jGvaaKk
In the beginning, there was the matrix multiply. A simple neural network is a chain of matrix multiplies. Let's say you have your data A1 and weights W1 in a matrix. You produce A2 as A1xW1. Then you produce A3 as A2xW2, and so on. There are other operations in there like non-linearities (so that you can actually learn something interesting) and fancy batch norms, but let's forget about those for now. The problem with this is, it's not very expressive. Let's say your A1 matrix has just 2 values, and you want the output to be their product. Can you learn a weight matrix that performs multiplication of these inputs? No you can't. Multiplication must be simulated by piecing together piecewise linear functions. To perform multiplication, the weight matrix W would also need to be produced by the network. Transformers do basically that. In the product A*W you replace A with (AxW1), W with (AxW2), and multiply those together: (AxW1)x(AxW2) And then do it once more for good measure: (AxW1)x(AxW2)x(AxW3). Boom, Nobel prize. Now your network can multiply, not just add. OK it's actually a bit more complicated, there is for example a softmax in the middle to perform normalisation, which in general helps during numerical optimisation: softmax((AxW1)x(AxW2))x(AxW3). There are then fancy explanations that try to retrospectively justify this as a "differentiable lookup table" or somesuch nonsense, calling the 3 parts "key", "query" and "value", which help make your paper more popular. But the basic idea is not so complicated. A Transformer then uses this operation as a building block (running them in parallel an in sequence) to build giant networks that can do really cool things. Maybe you can teach networks to divide next and then you get the next Nobel prize.
First, convert the input text to a sequence of token numbers (2048 tokens with 50257 possible token values in GPT-3) by using a dictionary and for each token, create a vector with 1 at the token index and 0 elsewhere, transform it with a learned "embedding" matrix (50257x12288 in GPT-3) and sum it with a vector of sine and cosine functions with several different periodicities.
Then, for each layer, and each attention head (96 layers and 96 heads per layer in GPT-3), transform the input vector by query, key and value matrices (12288x128 in GPT-3) to obtain a query, key and value vector for each token. Then for each token, compute the dot product of its query vector with the key vectors of all previous tokens, scale by 1/sqrt of the vector dimension and normalize the results so they sum to 1 by using softmax (i.e. applying e^x and dividing by the sum), giving the attention coefficients; then, compute the attention head output by summing the value vectors of previous tokens weighted by the attention coefficients. Now, for each token, glue the outputs for all attention heads in the layer (each with its own key/query/value learned matrices), add the input and normalize (normalizing means that the vector values are biased and scaled so they have mean 0 and variance 1).
Next, for the feedforward layer, apply a learned matrix, add a learned vector and apply a ReLU (which is f(x) = x for positive x and f(x) = kx with k near 0 for negative x), and do that again (12288x49152 and 49152x12288 matrices in GPT-3, these actually account for around 70% of the parameters in GPT-3), then add the input before the feedforward layer and normalize.
Repeat the process for each layer, each with their own matrices, passing the output of the previous layer as input. Finally, apply the inverse of the initial embedding matrix and use softmax to get probabilities for the next token for each position. For training, train the network so that they are close to the actual next token in the text. For inference, output a next token according to the top K tokens in the probability distribution over a cutoff and repeat the whole thing to generate tokens until an end of text token is generated.
A transformer is a type of neural network that, like many networks before, is composed of two parts: the "encoder" that receives a text and builds an internal representation of what the text "means"[1], and the "decoder" that uses the internal representation built by the encoder to generate an output text. Let's say you want to translate the sentence "The train is arriving" to Spanish.
Both the encoder and decoder are built like Lego, with identical layers stacked on top of each other. The lowest lever of the encoder looks at the input text and identifies the role of individual words and how they interact with each other. This is passed to the layer above, which does the same but at a higher level. In our example it would be as if the first layer identified that "train" and "arrive" are important, then the second one identifies that "the train" and "is arriving" are core concepts, the third one links both concepts together, and so on.
All of these internal representations are then passed to the decoder (all of them, not just the last ones) which uses them to generate a single word, in this case "El". This word is then fed back to the decoder, that now needs to generate an appropriate continuation for "El", which in this case would be "tren". You repeat this procedure over and over until the transformer says "I'm done", hopefully having generated "El tren está llegando" in the process.
The attention mechanism already existed before transformers, typically coupled with an RNN. The key concept of the transformer was building an architecture that removed the RNN completely. The negative side is that it is a computationally inefficient architecture as there are plenty of n^2 operations on the length of the input [2]. Luckily for us, a bunch of companies started releasing for free giant models trained on lots of data, researchers learned how to "fine tune" them to specific tasks using way less data than what it would have taken to train from scratch, and transformers exploded in popularity.
[1] I use "mean" in quotes here because the transformer can only learn from word co-occurrences. It knows that "grass" and "green" go well together, but it doesn't have the data to properly say why. The paper "Climbing towards NLU" is a nice read if you care about the topic, but be aware that some people disagree with this point of view.
[2] The transformer is less efficient that an LSTM in the total number of operations but, simultaneously, it is easier to parallelize. If you are Google this is the kind of problem you can easily solve by throwing a data center or two at the problem.
https://peterbloem.nl/blog/transformers
https://e2eml.school/transformers.html
I would also add Luis Serrano's article here: https://txt.cohere.com/what-are-transformer-models/ (HN discussion: https://news.ycombinator.com/item?id=35576918).
Looking back at The Illustrated Transformer, when I introduce people to the topic now, I find I can hide some complexity by omitting the encoder-decoder architecture and focusing only on one. Decoders are great because now a lot of people come to Transformers having heard of GPT models (which are decoder only). So for me, my canonical intro to Transformers now only touches on a decoder model. You can see this narrative here: https://www.youtube.com/watch?v=MQnJZuBGmSQ
- This understanding can be encapsulated in "compressed" low dimensional vector representation of a sequences.
- You can use this understanding for many different downstream tasks, especially predicting the next item in a sequence.
- This approach scales really well with lots of GPUs and data and is super applicable to generating text.
In LLMs, this means go from prompt to answer. I'll cover inference only, not training.
I can't quite ELI5, but process is roughly:
- Write a prompt
- Convert each token in the prompt (roughly a word) into numbers. So "the" might map to the number 45.
- Get a vector representation of each word - go from 45 to [.1, -1, -2, ...]. These vector representations are how a transformer understands words.
- Combine vectors into a matrix, so the transformer can "see" the whole prompt at once.
- Repeat the following several times (once for each layer):
- Multiply the vectors by the other vectors. This is attention - it's the magic of transformers, that enables combining information from multiple tokens together. This generates a new matrix.
- Feed the matrix into a linear regression. Basically multiply each number in each vector by another number, then add them all together. This will generate a new matrix, but with "projected" values.
- Apply a nonlinear transformation like relu. This helps model more complex functions (like text input -> output!)
Note that I really oversimplified the last few steps, and the ordering.At the end, you'll have a matrix. You then convert this back into numbers, then into text.
Transformers, Explained: Understand the Model Behind GPT-3, BERT, and T5: https://daleonai.com/transformers-explained
Transformers from Scratch: https://e2eml.school/transformers.html
The first link covers Attention well.
Hope they will do the same for you ;-)
Large Language Models from scratch https://www.youtube.com/watch?v=lnA9DMvHtfI
Large Language Models: Part 2 https://www.youtube.com/watch?v=YDiSFS-yHwk
Basically tokens “talk” to each other and say this is what i have and this is what i look for.
The novelty in this paper is this "query-key-value" relation that gets learned. A lot of previous work in this area was focused on learning a rough state machine to which you input a set of state transitions and it will give you the most likely next state. This will also work but training such networks is very slow and you also don't have the capability to train the network to "attend" to certain part of the inputs. This lookup based technique lets you do that plus this is also very compute efficient (compared to previous techniques).
I'm missing a lot of details but that's basically the intuition behind this.
These are very excellent resources: - https://www.youtube.com/watch?v=ptuGllU5SQQ&list=PLoROMvodv4... - https://www.youtube.com/watch?v=OyFJWRnt_AY&pp=ygUfYXR0ZW50a...
> LoRA makes LLMs composable, piecewise, mathematically, so that if there are 10,000 LLMs in the wild, they will all eventually converge on having the same knowledge. This is what Geoffrey Hinton was referring to on his SkyNet tour.
I don't think that's right at all, LoRA freeze lots of the large model part and wouldn't let you just simply combine large models. Instead. I'm pretty sure Hinton is referring to data parallel training with batching:
> DataParallel (DP) - the same setup is replicated multiple times, and each being fed a slice of the data. The processing is done in parallel and all setups are synchronized at the end of each training step.
https://huggingface.co/docs/transformers/v4.15.0/parallelism
You can have many instances of the model training on different bits of data, and then just average the modified weights back together at the end. This combining of weights is what Hinton means when he says parallel copies of brains can learn things independently and then recombine them later a huge bandwidth speeds, whereas humans are far more limited to sharing separate experiences verbally or with like a multimedia presentation or something.
Imagine you have a box of toys. Some toys are more interesting to you than others, so you pay more attention to those toys and play with them more. The same thing happens in the "Attention is All You Need" paper, but instead of toys, we have words in a sentence.
Before this paper, when computers tried to translate one language to another, they would look at one word, translate it, then move to the next word. This works okay, but it's not great because sometimes the meaning of a word can depend on other words in the sentence.
The clever thing about the "Attention is All You Need" paper is that it taught the computer to pay attention to all the words in a sentence at the same time, but give more importance ("attention") to the words that matter most for understanding and translation.
This new way of translating languages using "attention" made computers much better at understanding and translating languages. It's like if you could become better at playing by understanding all your toys at once! This paper was a big deal in the field of machine learning and artificial intelligence because it improved how machines understand languages.
The main useful idea from ML is just that we could learn meaning directly from the data—and we have a lot of data thanks to the internet. So there was a lot of work that went into ways to learn the meaning of every word and the relationships between words directly from text data—with some impressive successes.
But in almost all cases one of the biggest problems was learning how words affect each other when they’re far apart. That was a really hard problem because if you want to know how any two words affect each other then there are a lot of pairs you need to try. If you have 10 words in a sentence then there’s about 100 pairs; and if you have 1000 words then you have about 1 million pairs. For many years it seemed silly to even try that; computers are fast, but not _that_ fast…right?
But eventually hardware got powerful enough that someone decided to throw away all the cleverness and complexity-instead they just did the most obvious thing: try _every_ pair of words. When you get down to it, that’s really all that attention is: just test every pair of inputs to see how similar they are.
The title of the paper “Attention is All You Need” highlights that you can get rid of all the other tricks that people had been inventing to work around this problem of relating words that are far apart and learning the right meaning from the data. You don’t need to remember earlier words, you don’t need a fixed size context window or dynamic context or pre-trained word vectors or many, many other ideas. You _just_ need Attention to learn what words mean and solve the long distance problem.
Now, it didn’t _really_ solve the problem because the original transformer could only handle around 500 tokens. This is what folks mean when they talk about the “context length” or “context window” of a Transformer model. And it’s why everyone has been so surprised and impressed when the context window for GPT jumped to 2,000 (that’s 16x more memory than the original transformer), and now we see models with 30k or 100k context windows.
In any case, at this point we’ve learned that the (relatively) simple idea of Transformers is actually incredibly powerful—and remarkably general-purpose. There are actually only a handful of new ideas in Transformer-based models today than in the original paper.
Not sure how well the slides can be understood by themselves, though I tried to be accommodating for that
Here is a very informative video and blog by Google Cloud Tech team explaining the game-changing nature of the self-attention to understand the contexts of the words being used in any sentences as proposed by the paper [1],[2].
[1] Transformers, explained: Understand the model behind GPT, BERT, and T5:
[2] Corresponding blog post:
Please suggest some paper that delves a bit more into the theory around the architecture.
However, most people gloss over other aspects of the "Attention is all you need" paper, which is a sense mis-titled.
For example, Andrej Karpathy pointed out that the paper had another significant improvement hidden in it: during training the gradients can take a "shortcut" so that the bottom layers are trained faster than in typical deep learning architectures. This enables very large and deep models to be trained in a reasonable time. Without this trick, the huge LLMs seen these days would not have been possible!
Andrej talks about it here: https://youtu.be/9uw3F6rndnA?t=238
You will need to watch videos.
Watch this playlist and you will understand: https://youtube.com/playlist?list=PLaJCKi8Nk1hwaMUYxJMiM3jTB...
Then watch this and you will understand even more: https://youtu.be/g2BRIuln4uc
Finally, watch this playlist: https://youtube.com/playlist?list=PL86uXYUJ7999zE8u2-97i4KG_...
If you're a programmer, start with Karpathy's video series. For a somewhat gentler intro, take a look at the MIT intro lectures first to build up on the fundamentals.
Then you're ready for The Illustrated Transformer, and afterwards, if you're brave enough, the Annotated Transformer.
It's a fascinating subject, more so when you have a solid grasp! And you'll be able to quickly spot people who kinda stumble they way through but have big gaps in understanding.
I worked on a few projects that were trying to develop foundation models for health care, aviation, and other domains. In particular I trained an LSTM model to write fake abstracts for clinical case reports.
We ran into many problems, but maybe the worst one with the LSTM is that a real document repeats itself a lot. For instance, somebody's name might turn up multiple times and the LSTM was very bad at that kind of copying. The LSTM community was arguing about solutions to this problem, but the attention mechanism in transformers makes it easy.
In the context of seq2seq models, attention is a technique to compute a weighted average of hidden states of the encoder. When I first realized this simple fact, everything finally clicked.
In constrast to a vanilla seq2seq which takes only the encoder's hidden state at the last timestep as the context vector, the context vector of a seq2seq with attention is a weighted average of all hidden states of every timestep. The weight of a hidden state of the encoder is a similarity score between the hidden state and the decoder's previous output (the decoder's current state). The similarity function can be as simple as a dot product, but there are various ways to do it.
Attention can improve a seq2seq model because now the encoder's last hidden state doesn't need to well represent the whole input sequence, which is hard if the length is long — now the decoder takes at every timestep all the encoder's hidden states and computes an average of them with the weights uniquely different from the other timesteps. The weights at a timestep represent which input words are more important and thus to focus on when the decoder is to output a word at that timestep.
More generally, attention takes a set of values vectors and a query vector and computes a weighted average (or more generally, a weighted sum, if the weights don't sum up to 1) of the values based on the query. In the context of seq2seq models, the values are the encoder's hidden states and query is the decoder's previous output.
Transformer is a patterning probabilistic machine for a sequence of identities[1]. These identities are fed to the transformer in lanes. The transformer is conditioned to shift lanes one position to the left until they make it to the output, and make a prediction in the right-most lane that got freed up. Attention adds an exponential amount of layer interconnectivity, when we compare it with a simple densely connected layers. The attention mask serves as a high-dimensional dropout, without which it would be extremely easy for the Transformer to simply repeat the inputs (and then fail to generalize when making the prediction). Each layer up until the vertical middle of the Transformer works with a higher contextual representation than the previous one, and this is again being unwound back to lower contexts from the middle layer back to the original identities (integers) on the outputs. This means that you have raw identities on the input and output which span a certain width/window of the input sequence, but in comparison the middle-most layer has a sequence of high level contexts spanning extreme lengths of the original input sequence, knowledge-wise. [1]It's important to know that modification (learning) by the Transformer, of the vector embeddings which represent the input/output identities/integers that the Transformer works with, constitute big portion of the Transformer's power, and the practical implication of that is that it's impractical to try to tell the Transformer that e.g. some of our identities are similar or there's some logical system in their similarity, because all the Transformer really cares about is the occurrence of these identities in the sequence we train the Transformer on, and the Transformer will figure out the similarities or any kind of logic in the sequence by itself.
Attention: y=W(x)x
W is Matrix, x & y Are vectors. In the second case, W is a function of the input.
"Again: Calling this "attention" at best a joke."
http://bactra.org/notebooks/nn-attention-and-transformers.ht...
The idea behind the Transformer is nice - but by far not Nobel prize deserving.
Don't believe the hype or people like Yegge, whoever that is - in a few years a new architecture will be the "Nobel candidate".
Also, the original Transformer paper, if you read is, is horribly written.
I have recently written a paper on this https://arxiv.org/abs/2302.01834
I also have a discord channel https://discord.cofunctional.ai.
Ironically, it's the same mechanism as what renormalization in QFT does. I'm getting increasingly convinced that it's also how the brain works.
The key part is the attention mechanism, as title of the paper may have spoiled. It works moreless like this:
- Start with an input sequence X1, X2 ... Xn. These are all vectors.
- Map the input sequence X into 3 new sequences of vectors: query (Q), key (K), and value(V), all of the same length as the input X. This is done using learnable mappings for each of the sequences (so one for X->Q, another for X->K and one for X->V).
- Compare similarity of every query with every key. This gives you a weight for each query/key pair. Call them W(Q1, V2) and so forth.
- Compute output Z as sum of every _value_ weighted by the weight for the respective query/key pair (so Z1 = V1W(Q1,K1) + V2W(Q1,K2) + ... + VnW(Q1,Kn), Z2 = V1W(Q2,K1) + V2*W(Q2,K2)...)
- and that's about it!
As throwawaymaths mentions, this is quite similar to a learnable hash table with the notable difference that the value fetched is also changed, so that it doesn't fetch "input at an index like i" but "whatever is important at an index like i".
Now a few implementation details on top of this:
- The description is for a single "attention head". Normally several, each with their own mappings for Q/K/V, are used, so the transformer can look at different "things" simultaneously. 8 attention heads seems pretty common.
- The description doesn't take the position in the sequence into account (W(Q1,K1) and W(Q1,Kn) are treated perfectly equally). To account for ordering, "positional encoding" is normally used. Usually this is just adding a bunch of scaled sine/cosine waves to the input. Works surprisingly well.
- The transformer architecture has a number of these "attention layers" stacked one after the other and also 2 different stacks (encoder, decoder). The paper is about machine translation, so the encoder is for the input text and the decoder for the output. Attention layers work just fine in other configurations as well.
The rest of the architecture is fairly standard stuff
In this case, AGI proponents are using words which are highly loaded to mean "is a thinking, reasoning being" in some way.
I don't like it. I would prefer that they'd chosen words which were more neutral and not based on illusions of intelligence, or allusions to known intelligent behaviour.
"attention" is a thing, sure. But, if you use this word in formal session presenting to Congress, you're misleading them without conscious effort to believe you think "it's alive"
I don't like it. I think in hindsight calling the field AI was a huge mistake.
If you want something to hang on this, think about legal english and words like "real property" -do you really know what a solicitor or lawyer or barrister or judge means when they say that? or "without let or hindrance" -what does the word "let" mean there?
Within legal contexts, using the jargon is a given. misusing them outside the courtroom as a non-legal practitioner is a recipe for disaster. This is were "Sovereign Citizens" are playing: look how well that's going.
https://www.youtube.com/watch?v=S27pHKBEp30
It's in the context of NLP, which is where transformers started of course.
We're going to represent words as vectors (a sequence of numbers). We would like it to be the case that the value of the numbers reflects the meaning of the words. Words that mean similar things should be near each other. We also want to represent higher level ideas, ideas that take multiple words to express, in the same way. You can think of all the possible vectors as the entire space of ideas.
To begin with, though, we just have a vector for each word. This is insufficient - does the word "bank" mean the edge of a river or a place to store money? Is it a noun or a verb? In order to figure out the correct vector for a particular instance of this word, we need to take into account its context.
A natural idea might be to look at the words next to it. This works okay, but it's not the best. In the sentence "I needed some money so I got in my car and took a drive down to the bank", the word that really tells me the most about "bank" is "money", even though its far away in the sentence. What I really want is to find informative words based on their meaning.
This is what transformers and attention are for. The process works like this: For each word, I compose a "query" - in hand-wavy terms, this says "I'm looking for any other words out there that are X". X could be "related to money" or "near the end of the sentence" or "are adjectives". Next, for each word I also compute a "key", this is the counterpart of the query, and says "I have Y". For each query, I compare it to all the keys, and find which ones are most similar. This tells me which words (queries) should pay attention to which other words (keys). Finally, for each word I compute a "value". Whereas the "key" was sort of an advertisement saying what sort of information the word has, the "value" is the information itself. Under the hood, the "query", "key" and "value" are all just vectors. A query and a key match if their vectors are similar.
So, as an example, suppose that my sentence is "Steve has a green thumb". We want to understand the meaning of the word "thumb". Perhaps a useful step for understanding any noun would be to look for adjectives that modify it. We compute a "query" that says "I'm looking for words near the end of the sentence that are adjectives". When computing a "key" for the word green, maybe we compute "I'm near the end of the sentence, I'm a color, I'm an adjective or a noun". These match pretty well, so "thumb" attends to "green". We then compute a "value" for "green" that communicates its meaning.
By combining the information we got from the word "green" with the information for the word "thumb", we can have a better understanding of what it means in this particular sentence. If we repeat this process many times, we can build up stronger understanding of the whole sentence. We could also have a special empty word at the end that represents "what might come next?", and use that to generate more text.
But how did we know which queries, keys and values to compute? How did we know how to represent a word's meaning as numbers at all? These seemingly impossible questions are what is being "learned". How exactly that happens would require an equally big explanation of its own.
Keep in mind that this explanation is very fuzzy, and is only intended to convey the loose intuition of what is going on. It leaves out many technical details and even gets some details intentionally wrong to avoid confusion.
Attention(Q, K, V ) = softmax( (Q * TRANSPOSED(T)) / sqrt(Dk) ) V
That's where i start to shake my head.
i remain disappointed at the staggering low quality of academic work. the writing here is appalling. no worked example provided, despite of evidence of one... typical academic crap.
wouldn't surprise me if you go to try it and its wrong, and none of the real problems have been solved.
This may make it difficult to explain and I already see many incorrect explanations here and even more lazy ones (why post the first Google result? You're just adding noise)
> Steve Yegge on Medium thinks that the team behind Transformers deserves a Nobel
First, Yegge needs to be able to tell me what Attention and Transformers are. More importantly, he needs to tell me who invented them.
That actually gets to our important point and why there are so many bad answers here and elsewhere. Because you're both missing a lot of context as well as there being murky definitions. This is also what makes it difficult to ELI5. I'll try, then try to give you resources to get an actually good answer.
== Bad Answer (ELI5) ==
A transformer is an algorithm that considers the relationship of all parts of a piece of data. It does this through 4 mechanisms and in two parts. The first part is composed of a normalization block and an attention block. The normalization block scales the data and ensures that the data is not too large. Then the attention mechanism takes all the data handed to it and considers how it is all related to one another. This is called "self-attention" when we only consider one input and it is called "cross-attention" when we have multiple inputs and compare. Both of these create a relationship that are similar to creating a lookup table. The second block is also composed of a normalization block followed by a linear layer. The linear layer reprocesses all the relationships it just learned and gives it context. But we haven't stated the 4th mechanism! This is called a residual layer or "skip" layer. This allows the data to pass right on by each of the above parts without being processed and this little side path is key to getting things to train efficiently.
Now that doesn't really do the work justice or give a good explanation of why or how things actually work. ELI5 isn't a good way to understand things for usage, but it is an okay place to start and learn abstract concepts. For the next level up I suggest Training Compact Transformers[0]. It'll give some illustrations and code to help you follow along. It is focused on vision transformers, but it is all the same. The next level I suggest Karpathy's video on GPT[1], where you will build transformers and he goes in a bit more depth. Both these are good for novices and people with little mathematical background. For more lore and understanding why we got here and the confusion over the definition of attention I suggest Lilian Wang's blog[2] (everything she does is gold). For a lecture and more depth I suggest Pascal Poupart's class. Lecture 19[3] is the one on attention and transformers but you need to at minimum watch Lecture 18 but if you actually have no ML experience or knowledge then you should probably start from the beginning.
The truth is that not everything can be explained in simple terms, at least not if one wants an adequate understanding. That misquotation of Einstein (probably originating from Nelson) is far from accurate and I wouldn't expect someone that introduced a highly abstract concept with complex mathematics (to such a degree that physicists argued he was a mathematician) would say something so silly. There is a lot lost when distilling a concept and neither the listener nor speaker should fool themselves into believing this makes them knowledgeable (armchair expertise is a frustrating point on the internet and has gotten our society in a lot of trouble).
[0] https://medium.com/pytorch/training-compact-transformers-fro...
[1] https://www.youtube.com/watch?v=kCc8FmEb1nY
[2] https://lilianweng.github.io/posts/2018-06-24-attention/
That is the real question.
Pro tip: if you want technical info on research-related topics, don't ask HN. Tech bros can't handle telling themselves they don't know something, so everyone will give their "take" on the subject at hand.
The "Attention is All You Need" paper, published in 2017 by Vaswani et al., introduced the Transformer model architecture. The paper proposed a new way to process sequences of data, such as words in a sentence or time steps in a time series, without using recurrent neural networks (RNNs) or convolutional neural networks (CNNs). Instead, it introduced a mechanism called "self-attention."
Self-attention allows the model to weigh the importance of different words in a sentence when processing each word. This attention mechanism helps the model to focus on the relevant parts of the input sequence. In other words, it pays attention to different words based on their contextual significance for a given task.
To understand self-attention, let's consider an example. Suppose we have a sentence: "The cat sat on the mat." When processing the word "sat," self-attention enables the model to assign higher weights to words like "cat" and "the" and lower weights to words like "on" and "the mat." This way, the model can learn which words are more relevant to understanding the context of "sat."
The Transformer model consists of an encoder and a decoder. The encoder processes the input sequence, such as a sentence, while the decoder generates the output sequence, such as a translated sentence. Both the encoder and decoder are composed of multiple layers of self-attention and feed-forward neural networks. The self-attention layers allow the model to capture dependencies between different words in the sequence, while the feed-forward networks help in learning more complex patterns.
The "Attention is All You Need" paper demonstrated that Transformers achieved state-of-the-art performance on machine translation tasks while being more parallelizable and requiring less training time compared to traditional RNN-based models. Since then, Transformers have become the go-to architecture for many NLP tasks and have been further improved with variations like BERT, GPT, and T5.
In summary, the Transformer model introduced in the "Attention is All You Need" paper replaced traditional recurrent or convolutional neural networks with self-attention, allowing the model to capture contextual relationships between words more effectively. This innovation has had a significant impact on the field of NLP and has become the foundation for many subsequent advances in the field.
Let's say that you are given a very good embedding of each English word as a vector of numbers. The idea of embeddings is that each dimension captures a different characteristic of the word. So for example, dimension 37 might capture gender and dimension 56 might capture how royal the word is. So "king" and "queen" will have very different scores in dimension 37 but both words will have a high score in dimension 56. These embeddings have been available for many years, eg word2vec.
The challenge is this: given a sentence with many words, how can you best encode the meaning of the sentence in a vector? The simplest approach is to take the embeddings for all the words and average them together to get a summary vector. This is a reasonable approach, and will work fine for simple tasks like assigning a positive or negative sentiment to the sentence. For example, it will do a good job of separating “I love this amazing product” and “I hate this terrible product”. This approach is analogous the “bag of words” model.
This simple model is missing two big things. First, when interpreting the meaning of each word, it uses the original embedding of that word without any regard for the context around the word. So “bank” will be assigned the same meaning in the sentence “we got money from the bank” and “we sat by the river bank.” Second, the model does not take into account the ordering of the words, so that “the dog bit the man” and “the man bit the dog” will both get the same result.
Said another way, our simple model lacks the expressibility to distinguish meaningful differences between sentences. Transformers address these deficiencies by making the model more expressive, while keeping it computationally efficient and easy to train.
First, the transformer recognizes that we need to reinterpret each word based on the other words in the sentence. Each “layer” of the transformer can be seen as doing a reinterpretation of each word based on its context. Successive layers are applied to reach iteratively better reinterpretations.
In order to reinterpret the word “bank” in the sentence “we got money from the bank”, we first need to score all of the other words based on their relevance to “bank”. Obviously, “money” should get a higher relevance score than “from”. A natural approach to get a relevance score is to take the dot product of each other word’s embedding against the embedding for the word bank. (The dot product of two vectors is a common metric to gauge their similarity.)
However, this is not quite expressive enough. For example, in the sentence “the food tastes disgusting”, the meaning of the word “disgusting” is actually not very similar to the meaning of “food”, but clearly “disgusting” is very relevant to the interpretation of “food.” To take this into account and improve the expressiveness of the model, the idea is to maintain a separate set of embeddings for each word to be used in the relevance score dot product. These embeddings are called “keys”. So when reinterpreting the word “bank” in the sentence “we got money from the bank”, we grab the key embeddings for all the words, and dot product each one against the separate query embedding for the word “bank”. For example, multiplying the key for “money” against the query for “bank” will tell us how relevant the word “money” is for reinterpreting the word “bank.” Note that we need to separate key and query to break the symmetry of the dot product. In the phrase "Monopoly" "money", the word "Monopoly" significantly changes our interpretation of the word "money", but "money" does not significantly change our interpretation of "Monopoly."
Now that we have these relevance scores, we normalize them to sum to 1, and then we reinterpret “bank” as a relevance-weighted average of the value vectors of all of the other words. This is called the Attention mechanism, because when reinterpreting each word we selectively "pay attention" to the words that are most relevant to it.
There are a number of details omitted in this description, but hopefully it gives a general sense. The black magic of designing ML architectures is developing the right intuition for what is just expressive enough to capture meaningful relationships, while still being easy to compute and leveraging modern hardware.
It's a bit like deciding how many legs to put on a table. It's not so much that 4 legs is theoretically correct, but rather that 2 legs definitely doesn't work, 3 legs seems okay but feels a bit iffy if we put our weight in certain places and it's not too much more expensive to add a fourth leg anyway, and 5 legs definitely seems like overkill.
———————————————
Major omitted details:
- The word embeddings are not fixed, but learned from scratch as trainable parameters
- The query, key and value vectors for each word are actually the output of the input embedding times three different matrices Q, K and V. The reason for doing this is a bit complex. In order to have successive layers of reinterpretation, you cannot keep using the same query vector for each word in the subsequent layers because you have reinterpreted what it means. After the first layer, you no longer have the word "bank", you just have a reinterpreted vector of numbers so there is no way to do a lookup to get a query vector. Multiplying the new vector by three different learned matrices is a clever way to get around this.
- Positional information is encoded by adding a (learned) positional vector the word embedding, so that the embedding for “bank” will look a little different if it is at the beginning of the sentence vs. the end of the sentence.
“It’s simple ! Just tokenize the context and allow differentiated hash maps to best map the embeddings matrix. Duh!”
Sure, I'll try to explain what a transformer is in the context of AI tools like ChatGPT, and what attention references in this context. Here is a five paragraph explanation:
*Paragraph 1: What is a transformer?* A transformer is a type of neural network that can process sequential data, such as natural language text, by learning the context and meaning of the data elements. A neural network is a computer program that can learn from data and perform tasks such as classification, regression, generation, etc. A transformer can take an input sequence, such as a sentence or a paragraph, and convert it into another sequence, such as a translation or a summary. A transformer can also generate sequences from scratch, such as writing a poem or a story.
*Paragraph 2: How does a transformer work?* A transformer works by using two main components: encoders and decoders. An encoder takes an input sequence and converts it into a vector representation called an encoding. A decoder takes an encoding and converts it back into an output sequence. Both the encoder and the decoder use a technique called attention to focus on the relevant parts of the input and output sequences. Attention is a way of measuring how much each element in a sequence influences or depends on another element.
*Paragraph 3: What is attention?* Attention is a mathematical technique that allows a transformer to learn the relationships between the elements in a sequence. For example, in a machine translation task, attention allows the transformer to translate words like 'it' into the correct gender in another language by looking at all the relevant words in the original sentence. Attention also allows the transformer to handle long sequences without losing information or context. There are different types of attention, such as self-attention, cross-attention, global attention, local attention, etc.
*Paragraph 4: Why are transformers important?* Transformers are important because they are among the most powerful and versatile types of neural networks invented to date. They can handle many types of sequential data, such as text, speech, images, videos, etc. They can perform many tasks, such as machine translation, text summarization, text generation, image captioning, speech recognition, etc. They can also learn from large amounts of data and make accurate predictions that drive their wider use.
*Paragraph 5: What are some examples of transformers?* Some examples of transformers are:
- ChatGPT: A transformer that can generate realistic and engaging conversations based on a given topic or context. - BERT: A transformer that can understand natural language and answer questions or perform other tasks based on it. - GPT-3: A transformer that can generate text for any purpose or domain based on a few words or sentences. - DALL-E: A transformer that can generate images from text descriptions. - AlphaStar: A transformer that can play the strategy game StarCraft at a superhuman level.
I hope this explanation helps you understand what a transformer is and what attention references in this context.
Source: Conversation with Bing, 5/17/2023 (1) What Is a Transformer Model? | NVIDIA Blogs. https://blogs.nvidia.com/blog/2022/03/25/what-is-a-transform.... (2) Transformer Neural Network Definition | DeepAI. https://deepai.org/machine-learning-glossary-and-terms/trans.... (3) Generative AI: AI Transformers. https://lablab.ai/blog/generative-ai-ai-transformers. (4) The Ultimate Guide to Transformer Deep Learning - Turing. https://www.turing.com/kb/brief-introduction-to-transformers.... (5) How Transformers Work. Transformers are a type of neural… | by Giuliano .... https://towardsdatascience.com/transformers-141e32e69591.
Transformer is a building block (a part) of a language model. "Language model" is an algorithm that can predict words following given words. For example, you can give a text to a model and get a summary of this text, or an answer to the question in the text, or a translation of the text.
Language models are often made of two parts - encoder and decoder. The encoder reads input text (each word is encoded as a bunch of numbers, for example, as list of 512 floating-point numbers) and produces a "state" (also a large list of numbers) which is expected to encode the meaning of the text. Then the decoder reads the state and produces the output as words (to be exact, as probabilities for every possible word in the dictionary to be at a certain position in the output).
Before Transformers, people tended to use so called "recurrent neural networks" architecture. With this approach, the encoder processes the text word by word and updates the state after every word:
state = some initial state
for word in text:
state = model(state, word)
model(...) here is a complicated mathematical function, often with millions of operations and parameters.As I have written above, after reading the text, the state should encode the meaning of the text.
But it turned out that this approach doesn't scale well with long or complicated texts because the information from beginning of the text gets lost. The model tends to "forget" what it had read before. So a new architecture, "Transformers", was proposed. The difference is that now we give entire text (each word encoded as bunch of numbers) to the model:
state = model(input text)
Now the model processes the text at once. But implementing this naively would result in a very large model with too many parameters that would require too much memory and computing time. So developers used a trick here - most of the time each input word is processed separately from others (as in recurrent model), but there are stages, called "attention" where the words are processed together (and those stages are relatively light), so it looks like this: # stage where all text is processed at once
# using quick algorithm
state1 = attention(input text)
# stage where each part of state is processed independently
# with lot of heavy calculations
state2 = map(some function, state1)
state3 = attention(state2)
state4 = map(some function, state3)
...
To summarize, in Transformers the model processes the text at once, but we have to employ tricks and split processing into stages to make calculation feasible. Probably that is why some people believe the authors should receive a reward for their work.I think this explanation is as far as one can get without learning ML.
As you read each panel of a comic book, you don't just look at the words in the speech bubbles, but you also pay attention to who's talking, what they're doing, and what happened in the previous panels. You might pay more attention to some parts than others. This is sort of like what the Transformer model does with text.
When the Transformer reads a sentence, it doesn't just look at one word at a time. It looks at all the words at once, and figures out which ones are most important to understand each other. This is called "attention." For example, in the sentence "The cat, which is black, sat on the mat," the Transformer model would understand that "cat" is connected to "black" and "sat on the mat."
The "attention" part is very helpful because, like in a comic book, understanding one part of a sentence often depends on understanding other parts. This makes the Transformer model really good at understanding and generating language.
Also, because the Transformer pays attention to all parts of the sentence at the same time, it can be faster than other models that read one word at a time. This is like being able to read a whole page of your comic book at once, instead of having to read each panel one by one.
The important thing about the transformers model is that it's the first one we have found which keeps unlocking more and more powerful and general cognitive abilities the more resources we throw at it (parameters, exaflops, datasets). I saw some interview with Ilya Sutskever where he says this; it almost certainly won't be the last or best one, but it was the first one.
--
Why was it the first one? How were these guys so clever and other ones couldn't figure it out?
OK so first you need some context. There is a lot of 'Newton standing on the shoulders of giants' going on here. If all of these giants were around in the 1970s, it probably would have been invented then. Heck for all we know something as good was invented in the 1970s but our computers were too smol to benefit from it. This is what John Carmack is currently looking into.
To really notice the scaling benefits of the transformer architecture, they needed to run billion parameter transformer models on linear-algebra-accelerating GPU chips using differentiable programming frameworks. These are some of the giants we are standing on. The research and development pipeline for these amazing GPUs like [thousands of tech companies -> ASML -> TSMC -> NVIDIA] didn't exist until not so long ago. The special properties of transformers wouldn't have been discovered so soon without this hardware stack.
Another giant we are standing on is the differentiable programming linear algebra libraries and frameworks similar to theano or tensorflow or pytorch or jax. They have had things like this under the name 'mathematical programming' like CPLEX but it wasn't as accessible. 'Differentiable programming' is a newish terminology for what used to be called 'automatic differentiation' where 'differentiation' means essentially the same as calculus derivative. Informally it means that these libraries can predict any tiny output effect of any tiny input change as a computationally cheap side-effect of computing the given output, even for complicated calculations. This capability makes optimization easier, in particular it generalizes the 'backpropagation' algorithm of traditional artificial neural networks.
--
What is the transformer model in more nerdy terms.
At one level, it's just a complicatedly parameterized function, where you can fit the parameters by training on data. This viewpoint puts the importance on the computational power applied to training the model with the advantage of differentiable programming. Some will probably guess that the details of the model architecture don't really matter as long as it has sickening amount of parameters and exaflops and dataset. Some version of this viewpoint is probably true in my opinion.
More specifically, the transformer architecture is like a chain of black box differentiable 'soft' lookup tables. The soft queries and keys and values are each lists of floating point numbers (for example a single soft query is a list of numbers, called a vector) and these vectors are stacked into matrices and the soft lookup is processed quickly with fast matrix multiplication tricks. Importantly, all of this is happening inside of a differentiable programming framework which lets you cheaply answer questions about how any small change to the input will affect the output. This capability is used for training, by making trillions of billions of tiny changes to the floating point numbers in the multiplication matrices in the boxes. At the end, the fully trained chain of black box functions can be used to compute a probability distribution over the next token in the message, which lets you generate messages or translate between languages or whatever.
Think of a conversation you had with a friend. While they were talking, you were probably not just listening to the words they were saying right now, but also remembering what they said a few minutes ago. Your brain was connecting the dots between different parts of the conversation to understand the full meaning. Now, imagine if you could only understand each word in isolation and couldn't remember anything from a few seconds ago. Conversations would be pretty hard to understand, right?
In early NLP models, this was a big problem. They couldn't easily look at the "context" of a conversation or a sentence. They could only look at a few words at a time, so they were a bit like our forgetful person. They were good at understanding the meaning of individual words, but not so good at understanding how those words fit together to create meaning.
I guess the ELI5 (with a BUNCH of details left out) is “Transformers: what if you didn’t have to process sentences as a sequence of words, but rather as a picture of words.”