9. Summary — one page on the Transformer
The paper in one sentence
Remove all recurrence. Replace it with self-attention — every position attending to every other simultaneously — and you get a model that trains faster, scales further, and understands language better.
The problem it solved
RNN-based seq2seq models (with or without Bahdanau attention) were sequentially bottlenecked: hidden state t depends on t−1, so you cannot parallelise across the sequence. Training was slow, long-range dependencies were still hard, and scaling was painful.
The core idea
Self-attention: Every position computes three vectors — Query, Key, Value — from its own input. Each position then scores every other position via dot products (Query · Key), applies softmax to get attention weights, and outputs a weighted sum of all Value vectors. All positions do this simultaneously.
Multi-head attention: Run h = 8 independent attention operations in parallel, each in a lower-dimensional subspace, then concatenate and project. Each head learns different relationship types.
No recurrence: The entire encoder processes all T positions in parallel. Training time drops dramatically.
Positional encoding: Sinusoidal vectors injected into input embeddings give the model positional information without recurrence.
Add & Norm: Every sub-layer (attention, FFN) is wrapped in a residual connection and layer normalisation for stable training.
The analogy
Instead of students answering one at a time (RNN), the entire classroom compares notes simultaneously. Every student asks every other student a question (Query-Key scoring), decides how much to listen (softmax weights), and blends the answers (weighted Value sum). The class learns in parallel.
The key equation
Attention(Q, K, V) = softmax( Q · Kᵀ / √dₖ ) · V
Where:
- Q = X · W^Q (query projections, shape T × dₖ)
- K = X · W^K (key projections, shape T × dₖ)
- V = X · W^V (value projections, shape T × dᵥ)
- Q · Kᵀ (T × T matrix of all pairwise dot products)
- / √dₖ (scaling to prevent softmax saturation)
- softmax(·) (row-wise, so each row is a probability distribution)
- · V (weighted sum of values, shape T × dᵥ)
Full encoder layer
A = MultiHead(X, X, X) ← self-attention (Q = K = V = X)
X' = LayerNorm(X + A) ← residual + layer norm
Z = FFN(X') ← position-wise 2-layer MLP
X'' = LayerNorm(X' + Z) ← residual + layer norm
Stack 6 encoder layers. Decoder adds a causal mask on self-attention and cross-attention to encoder outputs.
What it unlocked
BERT (2018) · GPT-1/2/3 (2018–2020) · T5 · Vision Transformer · AlphaFold 2 · DALL-E · LLaMA · Claude · Gemini. Every major AI system in use today.
What it left open
- O(T²) quadratic attention cost limits context length
- Positional encodings are absolute — relative position still awkward
- Enormous data and compute required to realise the architecture’s potential
- FFN layers underutilised (addressed by Mixture of Experts, Paper 09)
Difficulty
🔴 The math (Section 4–5) is advanced undergrad — matrix products, transpose, softmax, layer norm. 🟡 The concept and code (Sections 3, 6) are first-year college. 🟢 Sections 1–2, 8–9 are accessible to anyone.
Next paper: Paper 09 — Mixture of Experts → Back to: Paper 07 — Attention / Bahdanau →