8. Impact — the architecture that ate AI
There is no polite way to say this: the Transformer paper did not merely advance the field. It restructured it. Almost every major AI system built after 2018 uses the Transformer architecture or a direct descendant of it. Understanding this paper is not optional background knowledge for understanding modern AI. It is the foundation.
Immediate results: state-of-the-art translation
On WMT 2014 English-to-French, the Transformer achieved a BLEU score of 41.0 — surpassing all previous models, including ensembles (combinations of multiple models). On English-to-German, it achieved 28.4 BLEU, also a new record.
More striking: it achieved this in 12 hours of training on 8 GPUs. The best previous recurrent models took days to weeks. The speed-up from parallelism was as significant as the accuracy improvement.
BERT (2018, Paper 11): the encoder alone, revolutionising NLP
Google’s BERT (Bidirectional Encoder Representations from Transformers) used only the Transformer encoder, trained on a massive text corpus with two unsupervised objectives: predicting masked-out words and predicting whether two sentences were adjacent.
BERT set new records on 11 NLP benchmarks simultaneously. The key insight: pre-training a deep Transformer encoder on general text produces representations so rich that fine-tuning on a small task-specific dataset produces strong results.
BERT demonstrated that the Transformer was not just for translation — it was a general-purpose language understanding engine.
GPT (2018–2020, Papers 10, 12): the decoder alone, scaling to language generation
OpenAI’s GPT series used only the Transformer decoder, trained to predict the next word autoregressively. GPT-1 (2018) was a proof of concept. GPT-2 (2019) was large enough that OpenAI initially refused to release it, fearing misuse. GPT-3 (2020, Paper 12) scaled to 175 billion parameters and demonstrated in-context learning — the ability to perform tasks from just a few examples in the prompt.
GPT-3 did not just achieve new benchmarks. It shocked researchers with qualitative capabilities: coherent long-form writing, code generation, translation between languages it was not explicitly trained on. The Transformer had been scaled far enough that new capabilities emerged.
The scaling law: bigger is better (Paper 13)
Kaplan et al. (2020) showed that Transformer performance follows predictable power-law relationships with three quantities: compute, data, and model size. This meant you could predict how good a model would be before training it — if you knew your compute budget, you could calculate the optimal model size and dataset size.
This transformed AI from an art into a form of engineering. And it gave labs a clear incentive to scale the Transformer architecture as large as possible. Every major model since has been a scaled Transformer.
Beyond text: vision, protein, code
The Transformer’s power turned out not to be specific to language. In 2020, Vision Transformers (ViT) showed that applying the Transformer directly to image patches — treating each patch as a token — achieved state-of-the-art results on image classification.
In 2021, AlphaFold 2 used Transformer-based attention over sequences of amino acids to predict 3D protein structures with near-experimental accuracy — one of the most celebrated scientific results of the decade.
Code (Codex, 2021), music, images (DALL-E), and video followed. The architecture proved remarkably general.
The models you use today
If you have used ChatGPT, Claude, Gemini, Copilot, or any major AI assistant, you have been talking to a Transformer. The specific models differ in size, training data, safety fine-tuning, and architectural tweaks, but the core operation is the same Q·Kᵀ/√dₖ computed in every attention layer of every major model.
- Claude (Anthropic) — Transformer with constitutional AI training
- GPT-4 (OpenAI) — Transformer, reported to be a mixture-of-experts variant
- Gemini (Google DeepMind) — Transformer with multimodal extensions
- LLaMA (Meta, Paper 17) — Transformer with rotary positional embeddings and grouped-query attention
- Mistral (Paper 18) — Transformer with sliding-window attention for efficiency
All of these trace their architecture to the eight-page paper submitted to arXiv in June 2017 with the simple title “Attention Is All You Need.”
By the numbers
- Citations as of 2025: over 100,000
- Number of major AI products directly using the architecture: essentially all of them
- The phrase “attention is all you need” has become a cultural touchstone in AI — it appears on T-shirts, conference talks, and countless blog posts