The paper is only 15 pages and Figure 1 (the architecture diagram) is the single most reproduced figure in AI history. Read the abstract, skim Figure 1, then read Sections 3.1 (Attention) and 3.2 (Multi-Head Attention) with equations in hand. Table 2 (translation results) and Table 3 (ablation study) show what each design choice contributed.

Essential predecessor

Bahdanau, D., Cho, K., & Bengio, Y. (2014). Neural Machine Translation by Jointly Learning to Align and Translate. https://arxiv.org/abs/1409.0473

Paper 07 in this curriculum. The Transformer’s attention is the matrix-form generalisation of Bahdanau’s per-step attention. Reading both papers in sequence makes the evolution clear.

Luong, M.-T., Pham, H., & Manning, C. D. (2015). Effective Approaches to Attention-based Neural Machine Translation. https://arxiv.org/abs/1508.04025

The intermediary step: simplified dot-product attention. The Q·K formulation in the Transformer is directly inherited from Luong’s “dot” attention variant.

Layer normalisation (background reading)

Ba, J. L., Kiros, J. R., & Hinton, G. E. (2016). Layer Normalization. https://arxiv.org/abs/1607.06450

The Layer Norm paper the Transformer relies on. Short (12 pages) and readable. Explains why Batch Norm fails for RNNs/sequences and how Layer Norm fixes it.

Immediate descendants

Devlin, J. et al. (2018). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. https://arxiv.org/abs/1810.04805

Paper 11 in this curriculum. Encoder-only Transformer pre-trained on masked language modelling. Established pre-train + fine-tune as the dominant paradigm.

Radford, A. et al. (2018). Improving Language Understanding by Generative Pre-Training. (GPT-1) https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf

Paper 10 in this curriculum. Decoder-only Transformer pre-trained on next-token prediction. The ancestor of GPT-4 and ChatGPT.

Accessible explanations

Jay Alammar: “The Illustrated Transformer” https://jalammar.github.io/illustrated-transformer/

The definitive visual guide to the Transformer. Animated figures walk through Q, K, V computation, multi-head attention, and the full encoder-decoder architecture. If Section 4 of this paper felt dense, read this first. It is one of the most-read ML blog posts ever written.

Jay Alammar: “The Illustrated Attention” https://jalammar.github.io/visualizing-neural-machine-translation-mechanics-of-seq2seq-models-with-attention/

Covers Bahdanau attention first, then builds toward the Transformer. Useful for seeing the continuity between Papers 07 and 08.

Lilian Weng: “The Transformer Family” https://lilianweng.github.io/posts/2020-04-07-the-transformer-family/

A 2020 survey of all major Transformer variants — useful once you have understood the original. Shows how each limitation we discussed in Section 7 was eventually addressed.

Video lectures

Andrej Karpathy: “Let’s build GPT: from scratch, in code, spelled out” https://www.youtube.com/watch?v=kCc8FmEb1nY

A 2-hour video where Karpathy codes a Transformer from scratch in PyTorch, explaining every line. One of the best educational resources in all of AI. Watch this after reading Paper 08.

Stanford CS224N Lecture 9: Self-Attention and Transformers Available on YouTube — search “CS224N 2021 Lecture 9”

Next in this curriculum

Paper 09 — Mixture of Experts (Shazeer et al., 2017) →

Replaces the Transformer’s dense FFN with conditional computation — only some “expert” FFNs activate for each token. Directly addresses the FFN underutilisation limitation noted in Section 7.

Math tutorials you will need for Papers 09–12:

Further Reading — Paper 08: Transformer (2017)

Further Reading

The original paper