At both places it was obvious to me that word vectors were a dead end because frequently the most important words in the document are not in the dictionary and I had a doctrine at the time that a system like that has a number of stages and you can’t really recover from a mistake in an early stage later.
So we were very impressed w/ byte pair encoding as used in BERT because it might give less-than-optimal results if a word isn’t in the dictionary, no information has been lost.
We were also impressed with token activations (compared to word vectors) because a token activation doesn’t just represent a word, it represents the meaning of the word in context; it wasn’t clear to me that word vectors were really going to help because a word vector models relationships between similar words but it has no way of modeling words having different meanings.
With the LSTMs there was controversy about how to represent coherence in documents, for instance if you are writing a story about a person named such-and-such, you need to store that name and copy it over and over again or represent it in different ways (First Name, First Name + Last Name, pronouns …) and LSTMs and RNNs did not have a story for that and people didn’t like the obvious options. Transformers answer that question quite well.
I had no idea though that transformers were going to go as far as they did.
The ML community took to it very quickly; multi-headed attention was being used in everything very soon afterwards, and LSTMs were no longer used. But it took a few months. HuggingFace really kicked a lot of this off, then BERT, then DistilBERT.
It really wasn't until the original BERT paper came out and topped the superglue leaderboards that the move from lstm-cnn based architectures started. I remember feeling at the time that a ~300M parameter model was absolutely huge.