Further Reading — Paper 08: Transformer (2017)
Further Reading
The original paper
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention Is All You Need. NeurIPS 2017. https://arxiv.org/abs/1706.03762
The paper is only 15 pages and Figure 1 (the architecture diagram) is the single most reproduced figure in AI history. Read the abstract, skim Figure 1, then read Sections 3.1 (Attention) and 3.2 (Multi-Head Attention) with equations in hand. Table 2 (translation results) and Table 3 (ablation study) show what each design choice contributed.
Essential predecessor
Bahdanau, D., Cho, K., & Bengio, Y. (2014). Neural Machine Translation by Jointly Learning to Align and Translate. https://arxiv.org/abs/1409.0473
Paper 07 in this curriculum. The Transformer’s attention is the matrix-form generalisation of Bahdanau’s per-step attention. Reading both papers in sequence makes the evolution clear.
Luong, M.-T., Pham, H., & Manning, C. D. (2015). Effective Approaches to Attention-based Neural Machine Translation. https://arxiv.org/abs/1508.04025
The intermediary step: simplified dot-product attention. The Q·K formulation in the Transformer is directly inherited from Luong’s “dot” attention variant.
Layer normalisation (background reading)
Ba, J. L., Kiros, J. R., & Hinton, G. E. (2016). Layer Normalization. https://arxiv.org/abs/1607.06450
The Layer Norm paper the Transformer relies on. Short (12 pages) and readable. Explains why Batch Norm fails for RNNs/sequences and how Layer Norm fixes it.
Immediate descendants
Devlin, J. et al. (2018). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. https://arxiv.org/abs/1810.04805
Paper 11 in this curriculum. Encoder-only Transformer pre-trained on masked language modelling. Established pre-train + fine-tune as the dominant paradigm.
Radford, A. et al. (2018). Improving Language Understanding by Generative Pre-Training. (GPT-1) https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf
Paper 10 in this curriculum. Decoder-only Transformer pre-trained on next-token prediction. The ancestor of GPT-4 and ChatGPT.
Accessible explanations
Jay Alammar: “The Illustrated Transformer” https://jalammar.github.io/illustrated-transformer/
The definitive visual guide to the Transformer. Animated figures walk through Q, K, V computation, multi-head attention, and the full encoder-decoder architecture. If Section 4 of this paper felt dense, read this first. It is one of the most-read ML blog posts ever written.
Jay Alammar: “The Illustrated Attention” https://jalammar.github.io/visualizing-neural-machine-translation-mechanics-of-seq2seq-models-with-attention/
Covers Bahdanau attention first, then builds toward the Transformer. Useful for seeing the continuity between Papers 07 and 08.
Lilian Weng: “The Transformer Family” https://lilianweng.github.io/posts/2020-04-07-the-transformer-family/
A 2020 survey of all major Transformer variants — useful once you have understood the original. Shows how each limitation we discussed in Section 7 was eventually addressed.
Video lectures
Andrej Karpathy: “Let’s build GPT: from scratch, in code, spelled out” https://www.youtube.com/watch?v=kCc8FmEb1nY
A 2-hour video where Karpathy codes a Transformer from scratch in PyTorch, explaining every line. One of the best educational resources in all of AI. Watch this after reading Paper 08.
Stanford CS224N Lecture 9: Self-Attention and Transformers Available on YouTube — search “CS224N 2021 Lecture 9”
Next in this curriculum
Paper 09 — Mixture of Experts (Shazeer et al., 2017) →
Replaces the Transformer’s dense FFN with conditional computation — only some “expert” FFNs activate for each token. Directly addresses the FFN underutilisation limitation noted in Section 7.
Math tutorials you will need for Papers 09–12:
- Cross-Entropy Loss → ❌ not yet built
- Normalisation → ✅ (just built)
- Probability Distributions → ✅