How It Works — Step by Step

We will trace backpropagation through the simplest possible network with a hidden layer: 2 inputs → 1 hidden neuron → 1 output neuron.

Use this diagram as reference:

x₁ ──w₁──┐
           ├──► h ──w₃──► ŷ ──► Loss
x₂ ──w₂──┘

Where h is the hidden neuron and ŷ is the output.

Numbers we will use:

Inputs: x₁ = 1, x₂ = 2
Weights (initial): w₁ = 0.5, w₂ = 0.3, w₃ = 0.8
Correct answer: y = 1
Learning rate: η = 0.1
Activation function: sigmoid σ(z) = 1/(1+e^(-z)), which squashes any number to (0,1)

Step 1: Forward pass — compute the hidden layer output

z_h = w₁×x₁ + w₂×x₂
    = 0.5×1 + 0.3×2
    = 0.5 + 0.6 = 1.1

h = σ(z_h) = 1 / (1 + e^(-1.1))
           = 1 / (1 + 0.3329)
           = 1 / 1.3329
           ≈ 0.750

The hidden neuron takes inputs 1 and 2, computes their weighted sum (1.1), then squashes it through the sigmoid to get 0.750.

Step 2: Forward pass — compute the output

z_out = w₃ × h
      = 0.8 × 0.750 = 0.600

ŷ = σ(z_out) = 1 / (1 + e^(-0.600))
             = 1 / (1 + 0.5488)
             ≈ 0.646

The network predicts 0.646. The correct answer is 1. Time to compute the error.

Step 3: Compute the loss

Using squared error:

L = (y - ŷ)² = (1 - 0.646)² = (0.354)² ≈ 0.125

The loss is 0.125. Now the backward pass begins.

Step 4: Backward pass — gradient at the output

How much does the loss change when we change ŷ?

∂L/∂ŷ = -2(y - ŷ) = -2(1 - 0.646) = -2 × 0.354 = -0.708

How much does ŷ change when we change z_out? (Derivative of sigmoid.) The sigmoid’s derivative is: σ’(z) = σ(z) × (1 - σ(z))

∂ŷ/∂z_out = ŷ × (1 - ŷ) = 0.646 × (1 - 0.646) = 0.646 × 0.354 ≈ 0.229

Combining (chain rule):

∂L/∂z_out = ∂L/∂ŷ × ∂ŷ/∂z_out = -0.708 × 0.229 ≈ -0.162

This is called the delta (δ) of the output neuron — its error signal.

Step 5: Backward pass — gradient for w₃

How much does z_out change when we change w₃? Since z_out = w₃ × h:

∂z_out/∂w₃ = h = 0.750

So:

∂L/∂w₃ = ∂L/∂z_out × ∂z_out/∂w₃ = -0.162 × 0.750 ≈ -0.122

Update w₃:

w₃_new = 0.8 - 0.1 × (-0.122) = 0.8 + 0.0122 ≈ 0.812

Step 6: Backward pass — propagate error to the hidden layer

How much does z_out change when h changes? Since z_out = w₃ × h:

∂z_out/∂h = w₃ = 0.8

How much does h change when z_h changes? (Sigmoid derivative again):

∂h/∂z_h = h × (1 - h) = 0.750 × 0.250 = 0.1875

Chain rule all the way back to z_h:

∂L/∂z_h = ∂L/∂z_out × ∂z_out/∂h × ∂h/∂z_h
         = -0.162 × 0.8 × 0.1875
         ≈ -0.0243

This is the delta of the hidden neuron — the blame assigned to it.

Step 7: Update w₁ and w₂

∂L/∂w₁ = ∂L/∂z_h × ∂z_h/∂w₁ = -0.0243 × x₁ = -0.0243 × 1 = -0.0243
∂L/∂w₂ = ∂L/∂z_h × ∂z_h/∂w₂ = -0.0243 × x₂ = -0.0243 × 2 = -0.0486

w₁_new = 0.5 - 0.1 × (-0.0243) ≈ 0.502
w₂_new = 0.3 - 0.1 × (-0.0486) ≈ 0.305

Step 8: One training step is complete

All three weights have been updated:

w₁: 0.500 → 0.502
w₂: 0.300 → 0.305
w₃: 0.800 → 0.812

The changes are small — but in the right direction. Run this forward-backward cycle thousands of times on thousands of examples, and the network learns.

The key achievement: we updated w₁ and w₂ — the hidden layer weights — using information that flowed backwards from the output loss. The credit assignment problem is solved.

Next: The Mathematics →