The Code
Backpropagation from scratch in NumPy
This implements a complete neural network — forward pass, loss, backward pass, weight updates — in 25 lines. No PyTorch, no TensorFlow. Every line maps directly to the mathematics in Section 5.
We train it to learn XOR — the problem that defeated the single-layer Perceptron.
# What this code does: Implements backpropagation from scratch for a 2-layer network
# Paper: Learning Representations by Back-propagating Errors (1986)
# Run free at: https://colab.research.google.com/
import numpy as np
# XOR training data (the problem the Perceptron could NOT solve)
X = np.array([[0,0], [0,1], [1,0], [1,1]]) # 4 examples, 2 inputs each
y = np.array([[0], [1], [1], [0]]) # XOR outputs
# Initialise weights randomly (small numbers to start near zero)
np.random.seed(42)
W1 = np.random.randn(2, 4) * 0.5 # input→hidden: 2 inputs, 4 hidden neurons
W2 = np.random.randn(4, 1) * 0.5 # hidden→output: 4 hidden, 1 output
lr = 0.5 # learning rate
def sigmoid(z):
return 1 / (1 + np.exp(-z)) # squash any number to (0,1)
def sigmoid_deriv(z):
s = sigmoid(z)
return s * (1 - s) # sigmoid's own derivative: σ(z)×(1-σ(z))
# Training loop — 10,000 steps
for step in range(10001):
# ── FORWARD PASS ──────────────────────────────────────────
z1 = X @ W1 # (4,2)×(2,4) = (4,4): weighted sum, hidden layer
a1 = sigmoid(z1) # apply activation to hidden layer
z2 = a1 @ W2 # (4,4)×(4,1) = (4,1): weighted sum, output layer
yhat = sigmoid(z2) # final prediction (probability between 0 and 1)
# ── LOSS ──────────────────────────────────────────────────
loss = np.mean((y - yhat) ** 2) # mean squared error across all 4 examples
# ── BACKWARD PASS ─────────────────────────────────────────
# Output layer delta: how wrong is the output × how sensitive is sigmoid there
d_out = -(y - yhat) * sigmoid_deriv(z2) # shape (4,1)
# Gradient for W2: hidden activations × output delta
dW2 = a1.T @ d_out # shape (4,1)
# Propagate error back to hidden layer (chain rule through W2 and sigmoid)
d_hid = (d_out @ W2.T) * sigmoid_deriv(z1) # shape (4,4)
# Gradient for W1: inputs × hidden delta
dW1 = X.T @ d_hid # shape (2,4)
# ── WEIGHT UPDATE (gradient descent) ──────────────────────
W2 -= lr * dW2
W1 -= lr * dW1
if step % 2000 == 0:
print(f"Step {step:5d} | Loss: {loss:.4f}")
# Test the trained network
print("\nTrained predictions vs correct XOR:")
for i in range(4):
pred = sigmoid(sigmoid(X[i] @ W1) @ W2)[0]
print(f" Input {X[i]} → Predicted: {pred:.3f} | Correct: {y[i][0]}")
What you should see when you run this:
Step 0 | Loss: 0.2641
Step 2000 | Loss: 0.1823
Step 4000 | Loss: 0.0312
Step 6000 | Loss: 0.0089
Step 8000 | Loss: 0.0041
Step 10000 | Loss: 0.0024
Trained predictions vs correct XOR:
Input [0 0] → Predicted: 0.045 | Correct: 0
Input [0 1] → Predicted: 0.961 | Correct: 1
Input [1 0] → Predicted: 0.962 | Correct: 1
Input [1 1] → Predicted: 0.048 | Correct: 0
The network has learned XOR — the pattern the single-layer Perceptron could never learn. Predictions near 0 for (0,0) and (1,1), near 1 for (0,1) and (1,0). The loss goes from 0.26 to 0.002 — a 99% reduction.
What to change to experiment:
-
Run without the hidden layer. Change
W1to shape(2,1)and remove the hidden layer computation — computeyhat = sigmoid(X @ W1)directly. Watch it fail to converge on XOR. The single-layer version cannot solve it, just as Minsky and Papert predicted. -
Try more hidden neurons. Change
4to2inW1 = np.random.randn(2, 2). The network still learns XOR — because 2 hidden neurons are sufficient. Try1hidden neuron — does it still work? -
Watch the hidden layer learn representations. After training, print
sigmoid(X @ W1)— the hidden layer activations for each input. You will see that the 4 inputs [0,0], [0,1], [1,0], [1,1] have been mapped to 4 distinct patterns in hidden space — representations that make XOR easy to solve.