9. Summary — one page on the Transformer

The paper in one sentence

Remove all recurrence. Replace it with self-attention — every position attending to every other simultaneously — and you get a model that trains faster, scales further, and understands language better.

The problem it solved

RNN-based seq2seq models (with or without Bahdanau attention) were sequentially bottlenecked: hidden state t depends on t−1, so you cannot parallelise across the sequence. Training was slow, long-range dependencies were still hard, and scaling was painful.

The core idea

Self-attention: Every position computes three vectors — Query, Key, Value — from its own input. Each position then scores every other position via dot products (Query · Key), applies softmax to get attention weights, and outputs a weighted sum of all Value vectors. All positions do this simultaneously.

Multi-head attention: Run h = 8 independent attention operations in parallel, each in a lower-dimensional subspace, then concatenate and project. Each head learns different relationship types.

No recurrence: The entire encoder processes all T positions in parallel. Training time drops dramatically.

Positional encoding: Sinusoidal vectors injected into input embeddings give the model positional information without recurrence.

Add & Norm: Every sub-layer (attention, FFN) is wrapped in a residual connection and layer normalisation for stable training.

The analogy

Instead of students answering one at a time (RNN), the entire classroom compares notes simultaneously. Every student asks every other student a question (Query-Key scoring), decides how much to listen (softmax weights), and blends the answers (weighted Value sum). The class learns in parallel.

The key equation

Attention(Q, K, V) = softmax( Q · Kᵀ / √dₖ ) · V

Where:

Q = X · W^Q (query projections, shape T × dₖ)
K = X · W^K (key projections, shape T × dₖ)
V = X · W^V (value projections, shape T × dᵥ)
Q · Kᵀ (T × T matrix of all pairwise dot products)
/ √dₖ (scaling to prevent softmax saturation)
softmax(·) (row-wise, so each row is a probability distribution)
· V (weighted sum of values, shape T × dᵥ)

Full encoder layer

A   = MultiHead(X, X, X)       ← self-attention (Q = K = V = X)
X'  = LayerNorm(X + A)         ← residual + layer norm
Z   = FFN(X')                  ← position-wise 2-layer MLP
X'' = LayerNorm(X' + Z)        ← residual + layer norm

Stack 6 encoder layers. Decoder adds a causal mask on self-attention and cross-attention to encoder outputs.

What it unlocked

BERT (2018) · GPT-1/2/3 (2018–2020) · T5 · Vision Transformer · AlphaFold 2 · DALL-E · LLaMA · Claude · Gemini. Every major AI system in use today.

What it left open

O(T²) quadratic attention cost limits context length
Positional encodings are absolute — relative position still awkward
Enormous data and compute required to realise the architecture’s potential
FFN layers underutilised (addressed by Mixture of Experts, Paper 09)

Difficulty

🔴 The math (Section 4–5) is advanced undergrad — matrix products, transpose, softmax, layer norm. 🟡 The concept and code (Sections 3, 6) are first-year college. 🟢 Sections 1–2, 8–9 are accessible to anyone.

Next paper: Paper 09 — Mixture of Experts → Back to: Paper 07 — Attention / Bahdanau →