Attention Is All You Need
Paper 08 — Attention Is All You Need
Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser, Polosukhin · NeurIPS 2017 · arXiv:1706.03762
What this paper did
It replaced recurrence entirely.
Bahdanau’s attention (Paper 07) improved the decoder’s memory by letting it look back at all encoder states. But both encoder and decoder were still LSTMs — sequential by design, impossible to parallelise across the sequence.
The Transformer removed the LSTMs completely. Instead, every layer is built purely from attention operations and feed-forward networks — both of which process all positions in parallel. Training time dropped from days to hours. Long-range dependencies collapsed from O(T) steps to O(1).
The core equation:
Attention(Q, K, V) = softmax( Q · Kᵀ / √dₖ ) · V
Every position computes a Query vector (what am I looking for?), a Key vector (what do I offer?), and a Value vector (what do I send when selected?). All pairwise Query-Key scores are computed at once via matrix multiplication, normalised via softmax, and used to blend Value vectors. Eight of these attention “heads” run in parallel per layer.
Stack 6 encoder and 6 decoder layers of this, add positional encodings, layer normalisation, and residual connections, and you have the Transformer.
The Indian analogy
Instead of students answering one at a time (the RNN way), the whole classroom compares notes simultaneously. Every student sends a question (Query) to every other student, receives answers (Keys), decides how much to trust each answer (softmax weights), and blends the information (weighted Values). The classroom learns in parallel — one round of this gives everyone full context.
Read in this order
| Section | What you will learn | Difficulty | Time |
|---|---|---|---|
| 1. Context | The RNN wall in 2017 | 🟢 | 4 min |
| 2. The Problem | Sequential bottleneck and long-range limits | 🟢 | 3 min |
| 3. The Idea | Self-attention, Q/K/V, multi-head, positional encoding | 🟡 | 5 min |
| 4. The Math | Full attention formula with numerical worked example | 🔴 | 12 min |
| 5. Worked Example | One full encoder layer on “The chai is hot” | 🔴 | 12 min |
| 6. The Code | Scaled dot-product attention in NumPy | 🟡 | 6 min |
| 7. Limitations | Quadratic cost, positional encoding, compute requirements | 🟡 | 4 min |
| 8. Impact | BERT, GPT, AlphaFold, every AI system today | 🟢 | 4 min |
| 9. Summary | One-page recap | 🟢 | 3 min |
Also: Glossary · Quiz · Further Reading
Before you read: math tutorials you need
- Matrix Transpose → — Q·Kᵀ requires understanding transpose ✅
- Matrix Multiplication → — used throughout ✅
- Softmax Function → — converts scores to attention weights ✅
- Normalisation → — layer norm after every sub-layer ✅
- Dot Product → — foundation of Q·K scoring ✅
The full architecture at a glance
INPUT TOKENS
↓
Embedding + Positional Encoding
↓
┌─────────────── Encoder × 6 ───────────────┐
│ Multi-Head Self-Attention │
│ Add & Norm │
│ Feed-Forward Network │
│ Add & Norm │
└────────────────────────────────────────────┘
↓ (encoder output to decoder cross-attention)
┌─────────────── Decoder × 6 ───────────────┐
│ Masked Multi-Head Self-Attention │
│ Add & Norm │
│ Multi-Head Cross-Attention (←encoder) │
│ Add & Norm │
│ Feed-Forward Network │
│ Add & Norm │
└────────────────────────────────────────────┘
↓
Linear + Softmax → Output Probabilities
Navigation
← Paper 07 — Attention / Bahdanau → Paper 09 — Mixture of Experts
Discussion
Questions about this paper? Spotted something unclear? Start a discussion below — powered by GitHub, no separate account needed.