Paper 07 — Neural Machine Translation by Jointly Learning to Align and Translate

Bahdanau, Cho & Bengio · 2014 · arXiv:1409.0473

What this paper did

It broke the translation bottleneck.

The seq2seq model (Paper 06) forced all source information through a single fixed-size vector. For long sentences, this was like trying to describe an entire film using only one sentence — details get lost. Translation quality collapsed on anything beyond ~20 words.

Bahdanau’s team replaced the fixed vector with a dynamic attention mechanism: at each decoding step, the decoder computes a fresh context vector by taking a weighted sum of all encoder hidden states, where the weights reflect how relevant each source word is right now. These weights are the attention weights.

The result: the model never has to fully compress the source. It can look back at any part of it, at any decoding step, with any weight it chooses. Long sentences stopped being a problem. And the attention weights, visualised as a heatmap, showed that the model had independently learned to align source and target words — something linguists had catalogued by hand for decades.

The Indian analogy

A student answering a board exam essay question does not memorise the textbook and close it. She keeps it open, glancing back at the relevant paragraph for each sentence she writes. The encoder’s hidden states are the open textbook. The attention weights decide where her eyes point. The context vector is what she just read.

Read in this order

Section	What you will learn	Difficulty	Time
1. Context	Why translation needed this fix in 2014	🟢	4 min
2. The Problem	The fixed context vector bottleneck	🟢	3 min
3. The Idea	Attention weights, soft alignment, bidirectional encoder	🟢	4 min
4. The Math	Alignment scores, softmax, context vector — worked by hand	🔴	10 min
5. Worked Example	Full decoding walkthrough with toy numbers	🔴	10 min
6. The Code	Attention in 25 lines of NumPy	🟡	6 min
7. Limitations	Quadratic complexity, sequential bottleneck, no self-attention	🟡	4 min
8. Impact	How this paper created the Transformer era	🟢	4 min
9. Summary	One-page recap	🟢	3 min

Also: Glossary · Quiz · Further Reading

Before you read: math tutorials you need

Softmax Function → — how attention scores become weights ✅
Probability Distributions → — why attention weights are a distribution ✅
Dot Product → — used in alignment score computation ✅
Matrix Multiplication → — used in encoder weight matrices ✅

The key equations

eₜᵢ  = vₐᵀ · tanh(Wₐ · sₜ₋₁ + Uₐ · hᵢ)     ← alignment score (how relevant is source word i at step t?)
αₜᵢ  = exp(eₜᵢ) / Σⱼ exp(eₜⱼ)               ← attention weight (probability, sums to 1)
cₜ   = Σᵢ αₜᵢ · hᵢ                           ← context vector (fresh at every step)
sₜ   = f(sₜ₋₁, yₜ₋₁, cₜ)                    ← decoder update

← Paper 06 — Seq2Seq → Paper 08 — Transformer

Neural Machine Translation by Jointly Learning to Align and Translate