Section 09

What Came Next

First Learning Machine The Perceptron: A Probabilistic Model for Information Storage and Organization in the Brain 1958

What Came Next

The problem this paper left on the table

The Perceptron could learn. But it could only learn simple, linearly separable patterns.

XOR — and everything like it, every complex pattern that requires combining information in non-linear ways — was beyond it.

The obvious fix was: add more layers. If one layer of neurons can only draw one straight line, maybe two layers can draw two lines. Maybe ten layers can draw arbitrarily complex boundaries.

In theory, this was obviously correct. A two-layer network with a hidden layer can learn XOR. A sufficiently large multi-layer network can learn any function — this is called the Universal Approximation Theorem, proved decades later.

The problem was entirely practical: how do you train a multi-layer network?

The Perceptron learning rule tells you how to update the weights in the output layer based on the error. But what about the weights in the hidden layer? You cannot directly observe whether those neurons are “wrong” — their outputs are internal to the network, not compared to any training label. The error is only measured at the final output.

This is called the credit assignment problem: when the network makes a mistake, which hidden neuron weights are responsible? How do you apportion blame across many layers?


The researchers who built on this work

Several researchers worked on this problem through the 1970s and early 1980s, often in isolation from each other.

Paul Werbos described a solution in his 1974 PhD thesis — but it was published in an obscure venue and largely ignored.

In 1982, John Hopfield published influential work on a different kind of neural network — now called Hopfield Networks — that helped rekindle interest in the connectionist approach.

Then in 1986, three researchers at the University of California San Diego published a paper that would change everything: David Rumelhart, Geoffrey Hinton, and Ronald Williams.

Their paper, “Learning Representations by Back-propagating Errors,” gave the field exactly what it needed: an efficient algorithm for training multi-layer networks by propagating error signals backwards through the network from output to input, using the chain rule of calculus to compute how each weight contributed to the final error.

This algorithm is called backpropagation — and it is the engine that trains every neural network in the world today.


What directions did others explore from here?

Not everyone was convinced by the neural network approach. The 1970s and early 1980s were the era of expert systems — rule-based AI systems programmed by domain experts. These worked well for narrow, structured problems (diagnosing diseases from symptoms, configuring computer systems) and briefly looked like the future of AI.

They were not. Expert systems were brittle — they broke whenever they encountered inputs outside their programming. They could not learn. They could not generalise. By the late 1980s, it was clear that the knowledge engineering bottleneck — the need for human experts to manually codify all domain knowledge — was fatal.

The return of neural networks, with backpropagation, happened to arrive just as expert systems were failing. The timing was not coincidental — every time rule-based AI hits a wall, the field comes back to learning-based AI.


The trail this paper left

The Perceptron gave us three things that every subsequent paper in this timeline depends on:

1. The neuron as a computational unit. Every layer in every modern neural network is built from neurons performing a weighted sum followed by a nonlinear function. The architecture has not fundamentally changed since 1958 — it has only deepened and widened.

2. The learning-from-examples paradigm. The idea that you provide data and let the machine find the pattern — rather than write the pattern in yourself — is now so dominant it is taken for granted. It was radical in 1958.

3. The weight update as the mechanism of learning. Adjusting weights based on error is the core idea of gradient descent, which is the core idea of backpropagation, which is how every neural network is trained. The Perceptron learning rule is a primitive version of the same algorithm.


Next paper: Backpropagation (1986) →

Rumelhart, Hinton and Williams solve the credit assignment problem — teaching multi-layer networks to learn, and ending the first AI winter.