9. What came next — the road to attention

The seq2seq model proved that deep neural networks could master complex language mappings. It tore down the old statistical paradigms and unified NLP under the banner of deep learning.

But as we saw, the single-vector bottleneck was suffocating the model’s ability to handle long text.

The fix arrived almost immediately. In the very same year (2014), Dzmitry Bahdanau and his colleagues published a paper proposing a radical patch to the seq2seq architecture. They argued: why force the encoder to compress everything into one vector? Why not give the decoder access to all the hidden states of the encoder, and let the decoder dynamically focus on different parts of the input sentence at each step?

They called this mechanism attention.

In Paper 07 — Neural Machine Translation by Jointly Learning to Align and Translate, we’ll see how adding attention to the seq2seq model completely shattered the bottleneck problem.

And eventually, researchers would realise that attention was so powerful, they didn’t even need the LSTMs anymore — leading to the legendary Paper 08: Attention Is All You Need (the Transformer).

Where seq2seq lives today

The pure encoder-decoder-with-LSTMs design is mostly historical now. But its DNA is everywhere:

Every encoder-decoder Transformer (T5, BART, mT5, IndicBART) inherits the split-the-model-in-two structure from this paper.
Teacher forcing is still the dominant training technique for autoregressive language models.
Greedy and beam search at inference time are unchanged.
BLEU is still the first translation metric every new model reports.

Not bad for an architecture that was supposed to be “too simple to work”.

Done with the paper? Head to the glossary, quiz, or further reading.