How It Works — Step by Step
We will trace backpropagation through the simplest possible network with a hidden layer: 2 inputs → 1 hidden neuron → 1 output neuron.
Use this diagram as reference:
x₁ ──w₁──┐
├──► h ──w₃──► ŷ ──► Loss
x₂ ──w₂──┘
Where h is the hidden neuron and ŷ is the output.
Numbers we will use:
- Inputs: x₁ = 1, x₂ = 2
- Weights (initial): w₁ = 0.5, w₂ = 0.3, w₃ = 0.8
- Correct answer: y = 1
- Learning rate: η = 0.1
- Activation function: sigmoid σ(z) = 1/(1+e^(-z)), which squashes any number to (0,1)
Step 1: Forward pass — compute the hidden layer output
z_h = w₁×x₁ + w₂×x₂
= 0.5×1 + 0.3×2
= 0.5 + 0.6 = 1.1
h = σ(z_h) = 1 / (1 + e^(-1.1))
= 1 / (1 + 0.3329)
= 1 / 1.3329
≈ 0.750
The hidden neuron takes inputs 1 and 2, computes their weighted sum (1.1), then squashes it through the sigmoid to get 0.750.
Step 2: Forward pass — compute the output
z_out = w₃ × h
= 0.8 × 0.750 = 0.600
ŷ = σ(z_out) = 1 / (1 + e^(-0.600))
= 1 / (1 + 0.5488)
≈ 0.646
The network predicts 0.646. The correct answer is 1. Time to compute the error.
Step 3: Compute the loss
Using squared error:
L = (y - ŷ)² = (1 - 0.646)² = (0.354)² ≈ 0.125
The loss is 0.125. Now the backward pass begins.
Step 4: Backward pass — gradient at the output
How much does the loss change when we change ŷ?
∂L/∂ŷ = -2(y - ŷ) = -2(1 - 0.646) = -2 × 0.354 = -0.708
How much does ŷ change when we change z_out? (Derivative of sigmoid.) The sigmoid’s derivative is: σ’(z) = σ(z) × (1 - σ(z))
∂ŷ/∂z_out = ŷ × (1 - ŷ) = 0.646 × (1 - 0.646) = 0.646 × 0.354 ≈ 0.229
Combining (chain rule):
∂L/∂z_out = ∂L/∂ŷ × ∂ŷ/∂z_out = -0.708 × 0.229 ≈ -0.162
This is called the delta (δ) of the output neuron — its error signal.
Step 5: Backward pass — gradient for w₃
How much does z_out change when we change w₃? Since z_out = w₃ × h:
∂z_out/∂w₃ = h = 0.750
So:
∂L/∂w₃ = ∂L/∂z_out × ∂z_out/∂w₃ = -0.162 × 0.750 ≈ -0.122
Update w₃:
w₃_new = 0.8 - 0.1 × (-0.122) = 0.8 + 0.0122 ≈ 0.812
Step 6: Backward pass — propagate error to the hidden layer
How much does z_out change when h changes? Since z_out = w₃ × h:
∂z_out/∂h = w₃ = 0.8
How much does h change when z_h changes? (Sigmoid derivative again):
∂h/∂z_h = h × (1 - h) = 0.750 × 0.250 = 0.1875
Chain rule all the way back to z_h:
∂L/∂z_h = ∂L/∂z_out × ∂z_out/∂h × ∂h/∂z_h
= -0.162 × 0.8 × 0.1875
≈ -0.0243
This is the delta of the hidden neuron — the blame assigned to it.
Step 7: Update w₁ and w₂
∂L/∂w₁ = ∂L/∂z_h × ∂z_h/∂w₁ = -0.0243 × x₁ = -0.0243 × 1 = -0.0243
∂L/∂w₂ = ∂L/∂z_h × ∂z_h/∂w₂ = -0.0243 × x₂ = -0.0243 × 2 = -0.0486
w₁_new = 0.5 - 0.1 × (-0.0243) ≈ 0.502
w₂_new = 0.3 - 0.1 × (-0.0486) ≈ 0.305
Step 8: One training step is complete
All three weights have been updated:
- w₁: 0.500 → 0.502
- w₂: 0.300 → 0.305
- w₃: 0.800 → 0.812
The changes are small — but in the right direction. Run this forward-backward cycle thousands of times on thousands of examples, and the network learns.
The key achievement: we updated w₁ and w₂ — the hidden layer weights — using information that flowed backwards from the output loss. The credit assignment problem is solved.
Next: The Mathematics →