Section 05

The math: a toy translation example

Sequence to Sequence Learning with Neural Networks 2014

5. The math — a toy translation example

Let’s strip away the complex LSTM gates for a moment and look at the core math of a basic Sequence-to-Sequence Recurrent Neural Network (RNN). We will do a full, worked numerical example.

If you need a refresher on matrix math, read matrix multiplication.

The equations

The encoder updates its hidden state h_t at each step based on the previous hidden state h_{t-1} and the current input word vector x_t:

$$h_t = \tanh(W_h , h_{t-1} + W_x , x_t)$$

The context vector C is simply the final hidden state after T steps:

$$C = h_T$$

The decoder updates its hidden state s_t based on its previous state, the previous output word, and (initially) the context vector:

$$s_t = \tanh(W_s , s_{t-1} + W_y , y_{t-1})$$

Finally, the decoder produces a probability distribution over the vocabulary using a softmax layer:

$$P(y_t) = \text{softmax}(W_{out} , s_t)$$

Softmax turns raw scores into probabilities that sum to 1. We’ll cover it in detail in the softmax tutorial coming up before Paper 07; for now, you can use the probability basics tutorial as background.

The toy example

Let’s translate a two-word Hindi phrase to English.

  • Input (Hindi): “Chai” → “Piyo”
  • Target output (English): “Drink” → “Tea” → <EOS>

We’ll use extremely tiny matrices and vectors so we can calculate everything by hand.

Let our hidden state be a 2-dimensional vector. Initial encoder state:

$$h_0 = \begin{bmatrix} 0 \ 0 \end{bmatrix}$$

Word vectors (also 2D):

$$x_{\text{chai}} = \begin{bmatrix} 1.0 \ 0.0 \end{bmatrix}, \quad x_{\text{piyo}} = \begin{bmatrix} 0.0 \ 1.0 \end{bmatrix}$$

Encoder weight matrices:

$$W_h = \begin{bmatrix} 0.5 & 0.1 \ 0.2 & 0.5 \end{bmatrix}, \quad W_x = \begin{bmatrix} 0.8 & 0.2 \ 0.1 & 0.9 \end{bmatrix}$$

Encoder Step 1 (input: “Chai”)

Compute the linear combination. Since h_0 is zero, the W_h h_0 term drops out:

$$z_1 = W_x , x_{\text{chai}} = \begin{bmatrix} 0.8 & 0.2 \ 0.1 & 0.9 \end{bmatrix} \begin{bmatrix} 1.0 \ 0.0 \end{bmatrix} = \begin{bmatrix} 0.8 \ 0.1 \end{bmatrix}$$

Apply tanh:

$$h_1 = \tanh!\begin{bmatrix} 0.8 \ 0.1 \end{bmatrix} \approx \begin{bmatrix} 0.66 \ 0.10 \end{bmatrix}$$

Encoder Step 2 (input: “Piyo”)

$$W_h , h_1 = \begin{bmatrix} 0.5 & 0.1 \ 0.2 & 0.5 \end{bmatrix} \begin{bmatrix} 0.66 \ 0.10 \end{bmatrix} = \begin{bmatrix} 0.5{\times}0.66 + 0.1{\times}0.10 \ 0.2{\times}0.66 + 0.5{\times}0.10 \end{bmatrix} = \begin{bmatrix} 0.34 \ 0.182 \end{bmatrix}$$

$$W_x , x_{\text{piyo}} = \begin{bmatrix} 0.8 & 0.2 \ 0.1 & 0.9 \end{bmatrix} \begin{bmatrix} 0.0 \ 1.0 \end{bmatrix} = \begin{bmatrix} 0.2 \ 0.9 \end{bmatrix}$$

Add them:

$$z_2 = \begin{bmatrix} 0.34 \ 0.182 \end{bmatrix} + \begin{bmatrix} 0.2 \ 0.9 \end{bmatrix} = \begin{bmatrix} 0.54 \ 1.082 \end{bmatrix}$$

Apply tanh:

$$h_2 = \tanh!\begin{bmatrix} 0.54 \ 1.082 \end{bmatrix} \approx \begin{bmatrix} 0.49 \ 0.79 \end{bmatrix}$$

The context vector

The encoder has finished reading. Our context vector C is h_2:

$$C = \begin{bmatrix} 0.49 \ 0.79 \end{bmatrix}$$

All knowledge of the Hindi sentence is now packed into these two numbers. That’s the bottleneck we’ll talk about in Section 8.

Decoder initialisation

We pass this context vector to the decoder. The decoder’s initial state s_0 becomes C: [0.49, 0.79].

From here, the decoder uses s_0 and the <SOS> token to compute s_1, multiplies s_1 by an output vocabulary matrix W_out, and runs softmax to generate probabilities for the English dictionary — hopefully giving the highest probability to the word “Drink”.

Done. That’s a full forward pass of a seq2seq model, calculated by hand. Every LSTM seq2seq model in 2014 was doing exactly this, just with bigger numbers.