Neural Machine Translation by Jointly Learning to Align and Translate
Paper 07 — Neural Machine Translation by Jointly Learning to Align and Translate
Bahdanau, Cho & Bengio · 2014 · arXiv:1409.0473
What this paper did
It broke the translation bottleneck.
The seq2seq model (Paper 06) forced all source information through a single fixed-size vector. For long sentences, this was like trying to describe an entire film using only one sentence — details get lost. Translation quality collapsed on anything beyond ~20 words.
Bahdanau’s team replaced the fixed vector with a dynamic attention mechanism: at each decoding step, the decoder computes a fresh context vector by taking a weighted sum of all encoder hidden states, where the weights reflect how relevant each source word is right now. These weights are the attention weights.
The result: the model never has to fully compress the source. It can look back at any part of it, at any decoding step, with any weight it chooses. Long sentences stopped being a problem. And the attention weights, visualised as a heatmap, showed that the model had independently learned to align source and target words — something linguists had catalogued by hand for decades.
The Indian analogy
A student answering a board exam essay question does not memorise the textbook and close it. She keeps it open, glancing back at the relevant paragraph for each sentence she writes. The encoder’s hidden states are the open textbook. The attention weights decide where her eyes point. The context vector is what she just read.
Read in this order
| Section | What you will learn | Difficulty | Time |
|---|---|---|---|
| 1. Context | Why translation needed this fix in 2014 | 🟢 | 4 min |
| 2. The Problem | The fixed context vector bottleneck | 🟢 | 3 min |
| 3. The Idea | Attention weights, soft alignment, bidirectional encoder | 🟢 | 4 min |
| 4. The Math | Alignment scores, softmax, context vector — worked by hand | 🔴 | 10 min |
| 5. Worked Example | Full decoding walkthrough with toy numbers | 🔴 | 10 min |
| 6. The Code | Attention in 25 lines of NumPy | 🟡 | 6 min |
| 7. Limitations | Quadratic complexity, sequential bottleneck, no self-attention | 🟡 | 4 min |
| 8. Impact | How this paper created the Transformer era | 🟢 | 4 min |
| 9. Summary | One-page recap | 🟢 | 3 min |
Also: Glossary · Quiz · Further Reading
Before you read: math tutorials you need
- Softmax Function → — how attention scores become weights ✅
- Probability Distributions → — why attention weights are a distribution ✅
- Dot Product → — used in alignment score computation ✅
- Matrix Multiplication → — used in encoder weight matrices ✅
The key equations
eₜᵢ = vₐᵀ · tanh(Wₐ · sₜ₋₁ + Uₐ · hᵢ) ← alignment score (how relevant is source word i at step t?)
αₜᵢ = exp(eₜᵢ) / Σⱼ exp(eₜⱼ) ← attention weight (probability, sums to 1)
cₜ = Σᵢ αₜᵢ · hᵢ ← context vector (fresh at every step)
sₜ = f(sₜ₋₁, yₜ₋₁, cₜ) ← decoder update
Discussion
Questions about this paper? Spotted something unclear? Start a discussion below — powered by GitHub, no separate account needed.