5. The math — a toy translation example
Let’s strip away the complex LSTM gates for a moment and look at the core math of a basic Sequence-to-Sequence Recurrent Neural Network (RNN). We will do a full, worked numerical example.
If you need a refresher on matrix math, read matrix multiplication.
The equations
The encoder updates its hidden state h_t at each step based on the
previous hidden state h_{t-1} and the current input word vector
x_t:
$$h_t = \tanh(W_h , h_{t-1} + W_x , x_t)$$
The context vector C is simply the final hidden state after T
steps:
$$C = h_T$$
The decoder updates its hidden state s_t based on its previous
state, the previous output word, and (initially) the context vector:
$$s_t = \tanh(W_s , s_{t-1} + W_y , y_{t-1})$$
Finally, the decoder produces a probability distribution over the vocabulary using a softmax layer:
$$P(y_t) = \text{softmax}(W_{out} , s_t)$$
Softmax turns raw scores into probabilities that sum to 1. We’ll cover it in detail in the softmax tutorial coming up before Paper 07; for now, you can use the probability basics tutorial as background.
The toy example
Let’s translate a two-word Hindi phrase to English.
- Input (Hindi): “Chai” → “Piyo”
- Target output (English): “Drink” → “Tea” →
<EOS>
We’ll use extremely tiny matrices and vectors so we can calculate everything by hand.
Let our hidden state be a 2-dimensional vector. Initial encoder state:
$$h_0 = \begin{bmatrix} 0 \ 0 \end{bmatrix}$$
Word vectors (also 2D):
$$x_{\text{chai}} = \begin{bmatrix} 1.0 \ 0.0 \end{bmatrix}, \quad x_{\text{piyo}} = \begin{bmatrix} 0.0 \ 1.0 \end{bmatrix}$$
Encoder weight matrices:
$$W_h = \begin{bmatrix} 0.5 & 0.1 \ 0.2 & 0.5 \end{bmatrix}, \quad W_x = \begin{bmatrix} 0.8 & 0.2 \ 0.1 & 0.9 \end{bmatrix}$$
Encoder Step 1 (input: “Chai”)
Compute the linear combination. Since h_0 is zero, the W_h h_0 term
drops out:
$$z_1 = W_x , x_{\text{chai}} = \begin{bmatrix} 0.8 & 0.2 \ 0.1 & 0.9 \end{bmatrix} \begin{bmatrix} 1.0 \ 0.0 \end{bmatrix} = \begin{bmatrix} 0.8 \ 0.1 \end{bmatrix}$$
Apply tanh:
$$h_1 = \tanh!\begin{bmatrix} 0.8 \ 0.1 \end{bmatrix} \approx \begin{bmatrix} 0.66 \ 0.10 \end{bmatrix}$$
Encoder Step 2 (input: “Piyo”)
$$W_h , h_1 = \begin{bmatrix} 0.5 & 0.1 \ 0.2 & 0.5 \end{bmatrix} \begin{bmatrix} 0.66 \ 0.10 \end{bmatrix} = \begin{bmatrix} 0.5{\times}0.66 + 0.1{\times}0.10 \ 0.2{\times}0.66 + 0.5{\times}0.10 \end{bmatrix} = \begin{bmatrix} 0.34 \ 0.182 \end{bmatrix}$$
$$W_x , x_{\text{piyo}} = \begin{bmatrix} 0.8 & 0.2 \ 0.1 & 0.9 \end{bmatrix} \begin{bmatrix} 0.0 \ 1.0 \end{bmatrix} = \begin{bmatrix} 0.2 \ 0.9 \end{bmatrix}$$
Add them:
$$z_2 = \begin{bmatrix} 0.34 \ 0.182 \end{bmatrix} + \begin{bmatrix} 0.2 \ 0.9 \end{bmatrix} = \begin{bmatrix} 0.54 \ 1.082 \end{bmatrix}$$
Apply tanh:
$$h_2 = \tanh!\begin{bmatrix} 0.54 \ 1.082 \end{bmatrix} \approx \begin{bmatrix} 0.49 \ 0.79 \end{bmatrix}$$
The context vector
The encoder has finished reading. Our context vector C is h_2:
$$C = \begin{bmatrix} 0.49 \ 0.79 \end{bmatrix}$$
All knowledge of the Hindi sentence is now packed into these two numbers. That’s the bottleneck we’ll talk about in Section 8.
Decoder initialisation
We pass this context vector to the decoder. The decoder’s initial
state s_0 becomes C: [0.49, 0.79].
From here, the decoder uses s_0 and the <SOS> token to compute
s_1, multiplies s_1 by an output vocabulary matrix W_out, and
runs softmax to generate probabilities for the English dictionary —
hopefully giving the highest probability to the word “Drink”.
Done. That’s a full forward pass of a seq2seq model, calculated by hand. Every LSTM seq2seq model in 2014 was doing exactly this, just with bigger numbers.