1. Context — the translation bottleneck

By late 2014, the AI community was genuinely excited about a new approach to machine translation. The previous year, Ilya Sutskever’s team at Google had published the seq2seq paper (Paper 06), and researchers worldwide were rushing to improve it. Neural networks were, for the first time, learning to translate sentences end-to-end without hand-crafted rules or linguistic knowledge. Feed English text in; get French text out. It felt like magic.

But practitioners kept running into the same wall: the longer the sentence, the worse the translation.

Short sentences translated beautifully. “How are you?” → “Comment allez-vous?” Flawless. But feed in a sentence with twenty words — a paragraph from a news article, a complex technical instruction — and the output would degrade into something half-translated and half-garbled. The neural network was clearly forgetting things.

The reason was structural. In the seq2seq architecture, the entire meaning of the input sentence had to be compressed into a single, fixed-size vector — the “context vector” — before the decoder could begin translating. It did not matter whether the input was five words or fifty. The same small vector had to carry all of it. Researchers called this the bottleneck problem.

Dzmitry Bahdanau, a PhD student working with Kyunghyun Cho and Yoshua Bengio at the Université de Montréal, had a different intuition. He had observed how human translators work. A good human translator does not read a sentence once, commit it entirely to memory, then produce the translation from scratch. Instead, the translator’s eyes move back and forth. When writing a particular word in the target language, they glance back at the specific part of the source sentence that is relevant right now. The act of translating is inherently an act of selectively looking.

Could a neural network learn to do the same thing?

In September 2014, Bahdanau, Cho, and Bengio published “Neural Machine Translation by Jointly Learning to Align and Translate.” The paper proposed a mechanism that allowed the decoder to look back at every word of the source sentence at every decoding step, weighting each source word by how relevant it was to generating the current target word.

They called these relevance weights attention weights, and the mechanism was the attention mechanism.

The paper’s BLEU scores (the standard translation quality metric) on English-to-French translation improved dramatically for long sentences, exactly where seq2seq had been struggling. But more importantly, the paper produced something unexpected: you could now visualise what the model was “looking at.” The alignment matrix — a heatmap of attention weights — showed that the model had learned, without being told, that “zone” in French corresponds to “area” in English, that “European Economic Area” maps to “zone économique européenne” in roughly the same order. The model had discovered linguistic structure by itself.

That visualisation caught the imagination of the entire research community. Attention was not just a trick that improved BLEU scores. It was a new way of thinking about how neural networks could process information. Everything downstream — the Transformer, GPT, BERT, the models you use today — is built on the foundation of this paper.