2. The problem — variable lengths and mixed-up words

Standard neural networks before 2014 had a severe limitation: they expected a fixed-size input and produced a fixed-size output. If you trained a network to recognise a 28×28 pixel image, you couldn’t suddenly feed it a 50×50 image.

Language doesn’t play by fixed rules. Sentences stretch and shrink.

Think about translating everyday sentences:

“I am drinking chai.” (4 words)
“Main chai pi raha hoon.” (5 words)

Not only are the lengths different, but the mapping is rarely one-to-one. In English, the word order is “drinking” (verb) then “chai” (object). In Hindi, it’s “chai” (object) then “pi raha hoon” (verb phrase).

If you try to translate word-by-word, you get gibberish. You need a system that can absorb the entire input sentence, hold the underlying meaning in its memory, and then generate the output sentence from scratch — taking as many steps as it needs to finish the thought.

The core mathematical problem was this: how do you build a function that takes an input sequence

$$X = (x_1, x_2, \dots, x_T)$$

of length T, and maps it to an output sequence

$$Y = (y_1, y_2, \dots, y_{T’})$$

of a completely different length T'?

No existing neural architecture could do this cleanly. That’s the gap this paper filled.