HACKER Q&A
📣 TheCaptain4815

How did the AI community react to the Transformer Model back in 2017?


As the title states, when the transformer model was first introduced, what were the reactions? Did Ai work shift overnight? Obvi took about 4-5 years for production stuff to really make a difference (GPT3/ChatGPT), but how did the Ai community respond to its release?


  👤 PaulHoule Accepted Answer ✓
I was working for a company that had developed a kind of “foundation model” based on character CNNs when BERT came out. Previously I had worked on another project trying to build one for clinical notes based on LSTM.

At both places it was obvious to me that word vectors were a dead end because frequently the most important words in the document are not in the dictionary and I had a doctrine at the time that a system like that has a number of stages and you can’t really recover from a mistake in an early stage later.

So we were very impressed w/ byte pair encoding as used in BERT because it might give less-than-optimal results if a word isn’t in the dictionary, no information has been lost.

We were also impressed with token activations (compared to word vectors) because a token activation doesn’t just represent a word, it represents the meaning of the word in context; it wasn’t clear to me that word vectors were really going to help because a word vector models relationships between similar words but it has no way of modeling words having different meanings.

With the LSTMs there was controversy about how to represent coherence in documents, for instance if you are writing a story about a person named such-and-such, you need to store that name and copy it over and over again or represent it in different ways (First Name, First Name + Last Name, pronouns …) and LSTMs and RNNs did not have a story for that and people didn’t like the obvious options. Transformers answer that question quite well.

I had no idea though that transformers were going to go as far as they did.


👤 singhrac
I mean, it was clearly a better language model than the at-the-time best model (which, if I recall correctly, was Yang et al's mixture-of-softmaxes). But the biggest surprise over time has been how big of a sponge large transformer models are (absorbing huge amounts of data), which was definitely not evident until later. I was also personally surprised at how long you need to train these models (in terms of epochs) but I think some of the work leading to that (around optimal batch size, etc.) was already being done. I think that actually changed quite a lot over the next few years, from really long training (many epochs over the same data) to really "short" training (1-2 epochs on a much larger dataset).

The ML community took to it very quickly; multi-headed attention was being used in everything very soon afterwards, and LSTMs were no longer used. But it took a few months. HuggingFace really kicked a lot of this off, then BERT, then DistilBERT.


👤 bradfox2
On my team, there was not intially acceptance from folks that were used to traditional nlp techniques. The models built with transformers were viewed as over parameterized.

It really wasn't until the original BERT paper came out and topped the superglue leaderboards that the move from lstm-cnn based architectures started. I remember feeling at the time that a ~300M parameter model was absolutely huge.