Paper 07

Further Reading — Paper 07: Attention / Bahdanau (2014)

Neural Machine Translation by Jointly Learning to Align and Translate · 2014 · Dzmitry Bahdanau et al.

Further Reading

The original paper

Bahdanau, D., Cho, K., & Bengio, Y. (2014). Neural Machine Translation by Jointly Learning to Align and Translate. arXiv:1409.0473. https://arxiv.org/abs/1409.0473

Start at the abstract and Section 3 (the model). Figure 3 (the alignment heatmap) is the most discussed figure in the paper and requires no maths to appreciate. Section 4 (experiments) shows the BLEU vs sentence-length graphs that demonstrate the long-sentence improvement.

Essential context — read with this paper

Sutskever, I., Vinyals, O., & Le, Q. V. (2014). Sequence to Sequence Learning with Neural Networks. https://arxiv.org/abs/1409.3215

Paper 06 in this curriculum — the model that Bahdanau attention improves. The bottleneck problem is clearest after you have read both papers in sequence.

The direct follow-up: Luong attention

Luong, M.-T., Pham, H., & Manning, C. D. (2015). Effective Approaches to Attention-based Neural Machine Translation. arXiv:1508.04025. https://arxiv.org/abs/1508.04025

Proposes dot-product attention and “general” attention as faster alternatives to Bahdanau’s additive formulation. Table 2 shows they achieve similar BLEU with less computation. Short and readable. Reading this before Paper 08 bridges the gap well.

What came next: the Transformer

Vaswani, A. et al. (2017). Attention Is All You Need. https://arxiv.org/abs/1706.03762

Paper 08 in this curriculum. Reads Bahdanau as direct prior work in its introduction. The Query-Key-Value formulation here is the same weighted-sum idea, extended to self-attention and stripped of recurrence.

Accessible explanations

Jay Alammar’s blog: “Visualizing Neural Machine Translation Mechanics of Seq2seq Models with Attention” https://jalammar.github.io/visualizing-neural-machine-translation-mechanics-of-seq2seq-models-with-attention/

Animated GIFs showing the encoder states, the attention weights updating step by step, and the context vector being assembled. If the math in Section 4 felt dense, this visual walkthrough will make it click. Read this first if you are a visual learner.

Lilian Weng’s blog: “Attention? Attention!” https://lilianweng.github.io/posts/2018-06-24-attention/

A comprehensive survey of attention mechanisms from Bahdanau through the Transformer. Well organised, with equations and context. An excellent reference for seeing how the idea evolved.

Video lectures

Stanford CS224N — Lecture on NMT and Attention Available on YouTube: search “CS224N 2021 Lecture 8 NMT” Professor Christopher Manning’s group (who published Luong attention) covers the Bahdanau paper with slides and intuition. About 75 minutes.

Andrej Karpathy: “The Unreasonable Effectiveness of Recurrent Neural Networks” http://karpathy.github.io/2015/05/21/rnn-effectiveness/ Not about attention specifically, but explains the RNN era that Bahdanau attention improved upon. Essential background reading.

Next in this curriculum

Paper 08 — Attention Is All You Need (Transformer, 2017) →

The paper that took Bahdanau’s equations, removed all the recurrence, scaled everything up, and created the architecture that powers every major AI system today.

Math tutorials you will need for Paper 08:

Matrix Transpose → ❌ not yet built
Normalisation → ❌ not yet built
Softmax Function → ✅

← Back to paper overview