Paper 08 — Attention Is All You Need

Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser, Polosukhin · NeurIPS 2017 · arXiv:1706.03762

What this paper did

It replaced recurrence entirely.

Bahdanau’s attention (Paper 07) improved the decoder’s memory by letting it look back at all encoder states. But both encoder and decoder were still LSTMs — sequential by design, impossible to parallelise across the sequence.

The Transformer removed the LSTMs completely. Instead, every layer is built purely from attention operations and feed-forward networks — both of which process all positions in parallel. Training time dropped from days to hours. Long-range dependencies collapsed from O(T) steps to O(1).

The core equation:

Attention(Q, K, V) = softmax( Q · Kᵀ / √dₖ ) · V

Every position computes a Query vector (what am I looking for?), a Key vector (what do I offer?), and a Value vector (what do I send when selected?). All pairwise Query-Key scores are computed at once via matrix multiplication, normalised via softmax, and used to blend Value vectors. Eight of these attention “heads” run in parallel per layer.

Stack 6 encoder and 6 decoder layers of this, add positional encodings, layer normalisation, and residual connections, and you have the Transformer.

The Indian analogy

Instead of students answering one at a time (the RNN way), the whole classroom compares notes simultaneously. Every student sends a question (Query) to every other student, receives answers (Keys), decides how much to trust each answer (softmax weights), and blends the information (weighted Values). The classroom learns in parallel — one round of this gives everyone full context.

Read in this order

Section	What you will learn	Difficulty	Time
1. Context	The RNN wall in 2017	🟢	4 min
2. The Problem	Sequential bottleneck and long-range limits	🟢	3 min
3. The Idea	Self-attention, Q/K/V, multi-head, positional encoding	🟡	5 min
4. The Math	Full attention formula with numerical worked example	🔴	12 min
5. Worked Example	One full encoder layer on “The chai is hot”	🔴	12 min
6. The Code	Scaled dot-product attention in NumPy	🟡	6 min
7. Limitations	Quadratic cost, positional encoding, compute requirements	🟡	4 min
8. Impact	BERT, GPT, AlphaFold, every AI system today	🟢	4 min
9. Summary	One-page recap	🟢	3 min

Also: Glossary · Quiz · Further Reading

Before you read: math tutorials you need

Matrix Transpose → — Q·Kᵀ requires understanding transpose ✅
Matrix Multiplication → — used throughout ✅
Softmax Function → — converts scores to attention weights ✅
Normalisation → — layer norm after every sub-layer ✅
Dot Product → — foundation of Q·K scoring ✅

The full architecture at a glance

INPUT TOKENS
    ↓
Embedding + Positional Encoding
    ↓
┌─────────────── Encoder × 6 ───────────────┐
│  Multi-Head Self-Attention                 │
│  Add & Norm                                │
│  Feed-Forward Network                      │
│  Add & Norm                                │
└────────────────────────────────────────────┘
    ↓ (encoder output to decoder cross-attention)
┌─────────────── Decoder × 6 ───────────────┐
│  Masked Multi-Head Self-Attention          │
│  Add & Norm                                │
│  Multi-Head Cross-Attention (←encoder)     │
│  Add & Norm                                │
│  Feed-Forward Network                      │
│  Add & Norm                                │
└────────────────────────────────────────────┘
    ↓
Linear + Softmax → Output Probabilities

← Paper 07 — Attention / Bahdanau → Paper 09 — Mixture of Experts

Attention Is All You Need