5. The math — equations and a worked numeric example

Everything in this section uses only the math we have already built. If any of it feels unfamiliar, open the relevant tutorial alongside this page:

Vectors, introduction — for dot products and element-wise multiplication.
The chain rule — for how the gradient flows backward.
Gradient intuition — for what “the gradient vanishes” actually means.

5.1 Notation

We’ll use bold lowercase for vectors (x, h, c) and bold uppercase for matrices (W, U).

Symbol	Meaning
xₜ	input vector at step t
hₜ	hidden state vector at step t
cₜ	cell state vector at step t
W_·, U_·, b_·	weight matrices and bias vectors for each gate
σ(·)	sigmoid function, maps any real number to (0, 1)
tanh(·)	hyperbolic tangent, maps any real number to (−1, +1)
⊙	element-wise (Hadamard) product of two vectors of equal size

The subscripts f, i, c, o on weights refer to the forget, input, candidate, and output parts of the cell.

5.2 The six LSTM equations

For hidden size n and input size m, the standard LSTM cell computes:

fₜ  = σ(W_f xₜ + U_f hₜ₋₁ + b_f)                 ... (1) forget gate
iₜ  = σ(W_i xₜ + U_i hₜ₋₁ + b_i)                 ... (2) input gate
c̃ₜ = tanh(W_c xₜ + U_c hₜ₋₁ + b_c)               ... (3) candidate
cₜ  = fₜ ⊙ cₜ₋₁ + iₜ ⊙ c̃ₜ                       ... (4) cell update
oₜ  = σ(W_o xₜ + U_o hₜ₋₁ + b_o)                 ... (5) output gate
hₜ  = oₜ ⊙ tanh(cₜ)                              ... (6) hidden state

Notice: equations (1), (2), (3), (5) all have the same shape. Each is a small affine transformation of the input and previous hidden state, followed by a nonlinearity. You can think of them as four parallel views of the same joint input [hₜ₋₁, xₜ].

In real implementations, people stack all four weight matrices into one big matrix and do a single matrix multiplication — then split the result into four chunks and apply the appropriate nonlinearity to each. The math is unchanged; only the engineering is cleaner.

5.3 Counting parameters

For input size m and hidden size n, each of the four gates has:

Weights: one n × m matrix (W) + one n × n matrix (U) = n(m + n) numbers.
Bias: an n-vector = n numbers.

Four gates total:

total params = 4 × ( n(m+n) + n )
             = 4n(m+n) + 4n
             = 4n(m + n + 1)

For a realistic small LSTM with m = 100, n = 256:

4 × 256 × (100 + 256 + 1) = 4 × 256 × 357 = 365,568 parameters

Large, but manageable. This is why LSTMs were practical even on 1990s hardware — they were not deep, just unrolled in time.

5.4 A fully worked numeric example

We will now run one LSTM step with tiny, hand-computable numbers. Everything here can be verified with a calculator (or by hand, if you are patient).

Setup

Input size m = 1 (just a scalar input).
Hidden size n = 1 (a single hidden unit).
All the W, U, b are scalars.

Let the parameters be:

W_f = 0.7,  U_f = 0.5,  b_f = 0.1
W_i = 0.4,  U_i = 0.3,  b_i = 0.0
W_c = 0.8,  U_c = 0.6,  b_c = 0.0
W_o = 0.5,  U_o = 0.2,  b_o = 0.1

Starting values:

xₜ = 1.0     hₜ₋₁ = 0.5     cₜ₋₁ = 0.8

Step 1 — forget gate

z_f = W_f · xₜ + U_f · hₜ₋₁ + b_f
    = 0.7·1.0 + 0.5·0.5 + 0.1
    = 0.7 + 0.25 + 0.1
    = 1.05

fₜ = σ(1.05) = 1 / (1 + e⁻¹·⁰⁵)
   = 1 / (1 + 0.3499)
   = 1 / 1.3499
   ≈ 0.741

The forget gate fires at about 0.74 — it will keep 74% of the old memory.

Step 2 — input gate

z_i = 0.4·1.0 + 0.3·0.5 + 0.0
    = 0.4 + 0.15
    = 0.55

iₜ = σ(0.55) ≈ 0.634

About 63% of whatever candidate we produce will be written to memory.

Step 3 — candidate

z_c = 0.8·1.0 + 0.6·0.5 + 0.0
    = 0.8 + 0.3
    = 1.1

c̃ₜ = tanh(1.1) ≈ 0.800

Step 4 — cell state update

cₜ = fₜ · cₜ₋₁ + iₜ · c̃ₜ
   = 0.741 · 0.8 + 0.634 · 0.800
   = 0.5928 + 0.5072
   = 1.100

The new cell state is 1.100 — up from 0.8. Some old memory was kept, some new content was added.

Step 5 — output gate

z_o = 0.5·1.0 + 0.2·0.5 + 0.1
    = 0.5 + 0.1 + 0.1
    = 0.7

oₜ = σ(0.7) ≈ 0.668

Step 6 — hidden state

tanh(cₜ) = tanh(1.100) ≈ 0.800
hₜ = oₜ · tanh(cₜ)
   = 0.668 · 0.800
   ≈ 0.534

The new hidden state is about 0.534.

Summary of this step

Inputs:   xₜ = 1.0,  hₜ₋₁ = 0.5,  cₜ₋₁ = 0.8
Gates:    fₜ ≈ 0.741,  iₜ ≈ 0.634,  oₜ ≈ 0.668
Memory:   cₜ ≈ 1.100  (was 0.8)
Output:   hₜ ≈ 0.534  (was 0.5)

Every intermediate value is between −1 and +1 except the raw sum for c̃ₜ. The cell state is free to drift outside (−1, +1) because that is how it carries “pressure” of accumulated evidence. That drift is intentional.

5.5 Why the gradient survives — the key line

Look again at equation (4):

cₜ = fₜ ⊙ cₜ₋₁ + iₜ ⊙ c̃ₜ

Differentiate with respect to cₜ₋₁:

∂cₜ / ∂cₜ₋₁ = fₜ

That is the whole thing. No matrix multiply. No tanh derivative. Just the forget gate vector.

Now unroll backward through many steps. The gradient from step T back to step 1 picks up the product:

∏ (from t=2 to T) fₜ

If the forget gates stay near 1 — say around 0.95 — for a relevant memory slot, this product over 100 steps is 0.95¹⁰⁰ ≈ 0.006. Tiny, but nonzero. Compare this to the plain RNN, where each step multiplies by 0.1 or less, giving 0.1¹⁰⁰ = 10⁻¹⁰⁰ — utterly gone.

The LSTM does not eliminate gradient decay. It controls it. The network can learn to keep the forget gate near 1 for information it wants to preserve, and the gradient for that slot travels backward almost undistorted. Meanwhile, slots it wants to overwrite can have low forget gates, which correctly kills their gradient contribution.

This is the full justification for the architecture. Every other design choice — the input gate, the candidate, the output gate — is there to make this fundamental additive highway useful. Without them, the memory would just accumulate garbage. With them, it becomes a controllable store that the network itself learns to manage.

5.6 What the paper proved, briefly

The original 1997 paper includes a theorem and long proof that the “constant error carousel” — what we just saw as the additive cell state with forget gate near 1 — can preserve error signals over arbitrarily long time lags. It also shows empirical results on toy tasks where plain RNNs completely fail and LSTMs succeed, including a task with a time lag of over 1000 steps.

You do not need to read the proof. You have already seen its heart: equation (4) and its partial derivative. That one line is the whole contribution of the paper.

Next: the code — a minimal LSTM cell in PyTorch.