2. The problem — variable lengths and mixed-up words
Standard neural networks before 2014 had a severe limitation: they expected a fixed-size input and produced a fixed-size output. If you trained a network to recognise a 28×28 pixel image, you couldn’t suddenly feed it a 50×50 image.
Language doesn’t play by fixed rules. Sentences stretch and shrink.
Think about translating everyday sentences:
- “I am drinking chai.” (4 words)
- “Main chai pi raha hoon.” (5 words)
Not only are the lengths different, but the mapping is rarely one-to-one. In English, the word order is “drinking” (verb) then “chai” (object). In Hindi, it’s “chai” (object) then “pi raha hoon” (verb phrase).
If you try to translate word-by-word, you get gibberish. You need a system that can absorb the entire input sentence, hold the underlying meaning in its memory, and then generate the output sentence from scratch — taking as many steps as it needs to finish the thought.
The core mathematical problem was this: how do you build a function that takes an input sequence
$$X = (x_1, x_2, \dots, x_T)$$
of length T, and maps it to an output sequence
$$Y = (y_1, y_2, \dots, y_{T’})$$
of a completely different length T'?
No existing neural architecture could do this cleanly. That’s the gap this paper filled.