https://en.wikipedia.org/wiki/Attention_Is_All_You_Need
They adapted a technique developed for translation, which had already been advancing a lot over the past decade or so.
"Attention" requires really big matrices, and they threw truly vast amounts of data at it. People had been developing techniques for managing that sheer amount of computation, including dedicated hardware and GPUs.
It's still remarkable that it got so good. It's as if there is some emergent phenomenon that appeared only when enough data was approached the right way. So it's not at all clear whether significant improvements will require another significant discovery, or if it's just a matter of evolution from here.
This is also what limits them in other ways