Learning Representations by Back-propagating Errors

David Rumelhart, Geoffrey Hinton, Ronald Williams · 1986 · Nature

“We describe a new learning procedure, back-propagation, for networks of neurone-like units.” — Opening of the paper

The Perceptron ended in a crisis.

Minsky and Papert had proved in 1969 that single-layer networks were fundamentally limited. Multi-layer networks could solve those limitations — but nobody knew how to train them. The weights in the hidden layers seemed unreachable. Credit assignment was impossible. The first AI winter set in.

For seventeen years, this problem sat unsolved.

Then in 1986, David Rumelhart, Geoffrey Hinton, and Ronald Williams published a four-page paper in Nature — one of the most prestigious scientific journals in the world — showing that the solution had been under their noses all along. It was the chain rule of calculus, applied backwards through the network.

They called it backpropagation. It ended the first AI winter and started the modern era of neural networks.

Every neural network trained in the world today — every GPT, every image classifier, every speech recogniser — is trained using this algorithm or a direct descendant of it.

What is in this paper?

Section	What you will learn
Historical Context	The AI winter, why hidden layers were stuck, who solved it
The Problem	Credit assignment — how do you blame a hidden neuron for a mistake?
The Core Idea	Propagate the error backwards through the network using the chain rule
How It Works	Forward pass → compute loss → backward pass → update weights, step by step
The Mathematics	Derivatives, chain rule, gradient descent — the full equations
The Code	Implement backpropagation from scratch in NumPy
Why It Mattered	The end of AI winter, deep learning, every modern AI product
Limitations	Vanishing gradients, local minima, computational cost
What Came Next	LSTMs solve the vanishing gradient; the road to modern deep learning

Paper at a glance

Difficulty: Intermediate — requires understanding of derivatives and the chain rule
Reading time: 55 minutes for all 9 sections
Math you need: Derivatives · Chain Rule · Partial Derivatives · Gradient Intuition
Key terms: Backpropagation · Gradient Descent · Loss Function · Hidden Layer

Start reading

Begin with Historical Context →

Previous paper: The Perceptron (1958) ← Next paper: LSTM (1997) →

Learning Representations by Back-propagating Errors

What is in this paper?

Paper at a glance

Start reading

Discussion