1. Context — the end of hand-coded rules

To understand why this paper hit the AI world like a thunderbolt in 2014, you need to know how we were translating languages before it arrived.

For decades, machine translation was dominated by Statistical Machine Translation (SMT). If you wanted to build an English-to-Hindi translator, you couldn’t just throw data at a computer. You needed teams of linguists and engineers. They built phrase tables mapping “how are you” to “aap kaise hain”. They built reordering models because English puts the verb in the middle of a sentence (Subject-Verb-Object), while Hindi puts it at the end (Subject-Object-Verb). It was an endless, fragile game of patching together hand-crafted rules and probability tables.

By 2014, neural networks were showing immense promise. As we saw in Paper 04 (LSTM), we had networks that could handle sequential data. Thanks to Paper 05 (Word2Vec), we knew how to represent words as rich, dense mathematical vectors.

But nobody knew how to elegantly map a sequence of varying length (like an English sentence) to another sequence of a different varying length (like a French sentence) using only neural networks. Most researchers thought neural networks could maybe help the old statistical models — perhaps by ranking the final translations.

Then, Ilya Sutskever, Oriol Vinyals, and Quoc V. Le at Google published this paper. They ignored the old statistical alignment tables entirely. They proposed a pure, end-to-end neural network. You feed English text into one end, and French text comes out the other. The network figures out the grammar, the word order, and the meaning all by itself, purely by looking at millions of translated sentence pairs.

It was a bold, brute-force approach that many thought was too simple to work. But it did work, and it changed natural language processing forever.