5. The math — equations and a worked numeric example
Everything in this section uses only the math we have already built. If any of it feels unfamiliar, open the relevant tutorial alongside this page:
- Vectors, introduction — for dot products and element-wise multiplication.
- The chain rule — for how the gradient flows backward.
- Gradient intuition — for what “the gradient vanishes” actually means.
5.1 Notation
We’ll use bold lowercase for vectors (x, h, c) and bold uppercase for matrices (W, U).
| Symbol | Meaning |
|---|---|
| xₜ | input vector at step t |
| hₜ | hidden state vector at step t |
| cₜ | cell state vector at step t |
| W_·, U_·, b_· | weight matrices and bias vectors for each gate |
| σ(·) | sigmoid function, maps any real number to (0, 1) |
| tanh(·) | hyperbolic tangent, maps any real number to (−1, +1) |
| ⊙ | element-wise (Hadamard) product of two vectors of equal size |
The subscripts f, i, c, o on weights refer to the forget, input, candidate, and output parts of the cell.
5.2 The six LSTM equations
For hidden size n and input size m, the standard LSTM cell computes:
fₜ = σ(W_f xₜ + U_f hₜ₋₁ + b_f) ... (1) forget gate
iₜ = σ(W_i xₜ + U_i hₜ₋₁ + b_i) ... (2) input gate
c̃ₜ = tanh(W_c xₜ + U_c hₜ₋₁ + b_c) ... (3) candidate
cₜ = fₜ ⊙ cₜ₋₁ + iₜ ⊙ c̃ₜ ... (4) cell update
oₜ = σ(W_o xₜ + U_o hₜ₋₁ + b_o) ... (5) output gate
hₜ = oₜ ⊙ tanh(cₜ) ... (6) hidden state
Notice: equations (1), (2), (3), (5) all have the same shape. Each is a
small affine transformation of the input and previous hidden state,
followed by a nonlinearity. You can think of them as four parallel
views of the same joint input [hₜ₋₁, xₜ].
In real implementations, people stack all four weight matrices into one big matrix and do a single matrix multiplication — then split the result into four chunks and apply the appropriate nonlinearity to each. The math is unchanged; only the engineering is cleaner.
5.3 Counting parameters
For input size m and hidden size n, each of the four gates has:
- Weights: one
n × mmatrix (W) + onen × nmatrix (U) =n(m + n)numbers. - Bias: an
n-vector =nnumbers.
Four gates total:
total params = 4 × ( n(m+n) + n )
= 4n(m+n) + 4n
= 4n(m + n + 1)
For a realistic small LSTM with m = 100, n = 256:
4 × 256 × (100 + 256 + 1) = 4 × 256 × 357 = 365,568 parameters
Large, but manageable. This is why LSTMs were practical even on 1990s hardware — they were not deep, just unrolled in time.
5.4 A fully worked numeric example
We will now run one LSTM step with tiny, hand-computable numbers. Everything here can be verified with a calculator (or by hand, if you are patient).
Setup
- Input size m = 1 (just a scalar input).
- Hidden size n = 1 (a single hidden unit).
- All the W, U, b are scalars.
Let the parameters be:
W_f = 0.7, U_f = 0.5, b_f = 0.1
W_i = 0.4, U_i = 0.3, b_i = 0.0
W_c = 0.8, U_c = 0.6, b_c = 0.0
W_o = 0.5, U_o = 0.2, b_o = 0.1
Starting values:
xₜ = 1.0 hₜ₋₁ = 0.5 cₜ₋₁ = 0.8
Step 1 — forget gate
z_f = W_f · xₜ + U_f · hₜ₋₁ + b_f
= 0.7·1.0 + 0.5·0.5 + 0.1
= 0.7 + 0.25 + 0.1
= 1.05
fₜ = σ(1.05) = 1 / (1 + e⁻¹·⁰⁵)
= 1 / (1 + 0.3499)
= 1 / 1.3499
≈ 0.741
The forget gate fires at about 0.74 — it will keep 74% of the old memory.
Step 2 — input gate
z_i = 0.4·1.0 + 0.3·0.5 + 0.0
= 0.4 + 0.15
= 0.55
iₜ = σ(0.55) ≈ 0.634
About 63% of whatever candidate we produce will be written to memory.
Step 3 — candidate
z_c = 0.8·1.0 + 0.6·0.5 + 0.0
= 0.8 + 0.3
= 1.1
c̃ₜ = tanh(1.1) ≈ 0.800
Step 4 — cell state update
cₜ = fₜ · cₜ₋₁ + iₜ · c̃ₜ
= 0.741 · 0.8 + 0.634 · 0.800
= 0.5928 + 0.5072
= 1.100
The new cell state is 1.100 — up from 0.8. Some old memory was kept, some new content was added.
Step 5 — output gate
z_o = 0.5·1.0 + 0.2·0.5 + 0.1
= 0.5 + 0.1 + 0.1
= 0.7
oₜ = σ(0.7) ≈ 0.668
Step 6 — hidden state
tanh(cₜ) = tanh(1.100) ≈ 0.800
hₜ = oₜ · tanh(cₜ)
= 0.668 · 0.800
≈ 0.534
The new hidden state is about 0.534.
Summary of this step
Inputs: xₜ = 1.0, hₜ₋₁ = 0.5, cₜ₋₁ = 0.8
Gates: fₜ ≈ 0.741, iₜ ≈ 0.634, oₜ ≈ 0.668
Memory: cₜ ≈ 1.100 (was 0.8)
Output: hₜ ≈ 0.534 (was 0.5)
Every intermediate value is between −1 and +1 except the raw sum for c̃ₜ. The cell state is free to drift outside (−1, +1) because that is how it carries “pressure” of accumulated evidence. That drift is intentional.
5.5 Why the gradient survives — the key line
Look again at equation (4):
cₜ = fₜ ⊙ cₜ₋₁ + iₜ ⊙ c̃ₜ
Differentiate with respect to cₜ₋₁:
∂cₜ / ∂cₜ₋₁ = fₜ
That is the whole thing. No matrix multiply. No tanh derivative. Just the forget gate vector.
Now unroll backward through many steps. The gradient from step T back to step 1 picks up the product:
∏ (from t=2 to T) fₜ
If the forget gates stay near 1 — say around 0.95 — for a relevant memory
slot, this product over 100 steps is 0.95¹⁰⁰ ≈ 0.006. Tiny, but nonzero.
Compare this to the plain RNN, where each step multiplies by 0.1 or
less, giving 0.1¹⁰⁰ = 10⁻¹⁰⁰ — utterly gone.
The LSTM does not eliminate gradient decay. It controls it. The network can learn to keep the forget gate near 1 for information it wants to preserve, and the gradient for that slot travels backward almost undistorted. Meanwhile, slots it wants to overwrite can have low forget gates, which correctly kills their gradient contribution.
This is the full justification for the architecture. Every other design choice — the input gate, the candidate, the output gate — is there to make this fundamental additive highway useful. Without them, the memory would just accumulate garbage. With them, it becomes a controllable store that the network itself learns to manage.
5.6 What the paper proved, briefly
The original 1997 paper includes a theorem and long proof that the “constant error carousel” — what we just saw as the additive cell state with forget gate near 1 — can preserve error signals over arbitrarily long time lags. It also shows empirical results on toy tasks where plain RNNs completely fail and LSTMs succeed, including a task with a time lag of over 1000 steps.
You do not need to read the proof. You have already seen its heart: equation (4) and its partial derivative. That one line is the whole contribution of the paper.