Section 04

How it works: encoders, decoders, and backwards input

Sequence to Sequence Learning with Neural Networks 2014

4. How it works — encoders, decoders, and backwards input

Let’s look under the hood of the encoder-decoder architecture. Here is the exact step-by-step process of how an English sentence becomes a French sentence during inference (when the model is actually translating for a user).

Step 1: The encoder reads the input

We have an English sentence: “How are you”. We add a special token to mark the end of the sentence: <EOS> (end of sentence).

The encoder LSTM reads the vector for “How” and updates its hidden state. It takes that hidden state, reads “are”, and updates again. It reads “you”, updates. It reads <EOS>, and updates one final time.

Step 2: The context-vector handoff

The very last hidden state of the encoder LSTM is captured. This is the context vector. All the English words are now discarded. Everything the model knows about the sentence is crammed into this single fixed-size array of numbers.

Step 3: The decoder starts speaking

We awaken the decoder LSTM. We set its initial hidden state to be the context vector. We feed it a special start-of-sentence <SOS> token to get it going.

Based on the context vector and the <SOS> token, the decoder outputs a probability distribution over the entire French vocabulary. It picks the most likely word — let’s say “Comment”.

Step 4: The feedback loop (greedy decoding)

The decoder takes the word it just generated (“Comment”) and feeds it back into itself as the input for the next step. It looks at its updated hidden state, sees “Comment”, and predicts “allez”. It feeds “allez” back in, and predicts “vous”. It feeds “vous” back in, and predicts <EOS>. Once it predicts <EOS>, the translation is complete.

The music teacher: teacher forcing

The feedback loop described above is how the model works after it is trained. But during training, if the decoder predicts a wrong word early on, the rest of the sentence will turn into garbage. The model would learn very slowly.

To fix this, researchers use teacher forcing. Imagine a music teacher sitting with you at a harmonium. You are supposed to play the sequence Sa-Re-Ga-Ma. You play Sa, but then you mess up and play Pa instead of Re. If the teacher lets you continue from Pa, you will play the whole song wrong. Instead, the teacher corrects you: “No, the second note was Re. Now, assuming you played Re, what comes next?”

In teacher forcing, during training, we do not feed the decoder its own predicted word. We feed it the actual correct word from the training data, regardless of what it predicted. This keeps training stable and fast.

The weird hack: the reverse-input trick

Here is one of the most famous empirical hacks in deep-learning history. Sutskever noticed the network was struggling with long sentences. The context vector was having a hard time remembering the beginning of the English sentence by the time it got to the end.

His solution? He reversed the English sentence before feeding it to the encoder.

Instead of feeding “A B C”, he fed “C B A”. The decoder still generated the target sequence normally: “X Y Z”.

Why did this work so well?

Think about how Indian addresses are often written locally — Name, House Number, Street, City, State, PIN code. An English postal system usually expects the reverse, starting broad and getting specific. The information closest to what you need is placed closest to where it will be used.

Same idea here. If you feed the source sentence normally, word “A” is far away from word “X” in the network’s processing steps. By reversing the input to “C B A”, word “A” is processed right before the context vector is handed off. Therefore, word “A” is very fresh in the network’s memory just as the decoder needs to start translating the beginning of the sentence to word “X”. This simple trick dramatically improved their BLEU scores (the standard metric for measuring translation quality).