Section 07

Why It Mattered

How Neural Networks Learn Learning Representations by Back-propagating Errors 1986

Why It Mattered

The end of the first AI winter

The 1986 paper did not immediately trigger a revolution. The deep learning explosion came in the 2000s and 2010s. But the paper ended the intellectual drought — it gave researchers a working algorithm for training multi-layer networks, and demonstrated that such networks could learn complex, meaningful internal representations.

Within a few years of publication:

1989 — Yann LeCun’s handwriting recognition Yann LeCun (then at Bell Labs, now at Meta AI) applied backpropagation to a convolutional network trained on handwritten digits. His network, called LeNet, could recognise ZIP codes written on envelopes. By the mid-1990s, it was processing 10–20% of all cheques written in the United States. This was the first major commercial success of backpropagation.

1989 — Universal Approximation Theorem George Cybenko proved that a neural network with a single hidden layer and a sufficient number of neurons can approximate any continuous function to arbitrary accuracy. This gave theoretical justification for what practitioners were already seeing in practice: deep networks are extraordinarily expressive.

1997 — LSTMs The vanishing gradient problem (discussed in Limitations) blocked backpropagation from training very deep or very long networks. Hochreiter and Schmidhuber solved this for sequences by designing the LSTM — a specialised architecture that allows gradients to flow over long time sequences without vanishing. LSTMs powered almost all of NLP from their invention until 2017, when transformers arrived.


The deep learning explosion (2012 onwards)

The full power of backpropagation was not realised until three things came together:

  1. Big data: The internet produced datasets of millions of labelled images, texts, and other examples — enough to train very deep networks without overfitting
  2. GPUs: Graphics processing units turned out to be ideal for the matrix operations in neural networks, providing 10–100× speedups over CPUs
  3. Algorithmic improvements: Better activation functions (ReLU replacing sigmoid), better weight initialisation, dropout regularisation, batch normalisation

When these three combined, starting around 2009–2012, the results were startling.

ImageNet 2012: AlexNet, a deep convolutional network trained with backpropagation on GPUs, reduced the image classification error rate from 26% to 15% — a margin so large that it convinced the computer vision community to abandon classical methods almost overnight.

From that point, every major AI advance has been built on backpropagation:

  • GPT-1, 2, 3, 4, Claude, Gemini — language models trained with backprop
  • AlphaFold — protein structure predictor trained with backprop
  • AlphaGo — game-playing AI trained with backprop + reinforcement learning
  • DALL-E, Stable Diffusion — image generators trained with backprop
  • Every self-driving car perception system — trained with backprop

Why Geoffrey Hinton won the Nobel Prize

In 2024, Geoffrey Hinton (co-author of this paper) shared the Nobel Prize in Physics with John Hopfield for their foundational contributions to neural networks. The Nobel committee specifically cited backpropagation.

This was remarkable: a Nobel Prize in Physics for a computer science algorithm. The committee’s reasoning: neural networks have become a fundamental tool across all of science, comparable in importance to the microscope or the telescope. Hinton’s work on backpropagation made this possible.

It was also bittersweet. Hinton had left Google in 2023, partly to speak freely about the risks of powerful AI systems — systems whose power traces directly back to his own work.


Why a student in small-town India should care

Every AI tool you will use in your career — every translation service, every code assistant, every image generator, every voice recognition system — was trained using backpropagation.

Understanding backpropagation is not just interesting history. It is practical knowledge. When you use a machine learning library and call loss.backward(), you are invoking this algorithm. When you read a research paper and see “trained with stochastic gradient descent,” you are reading about this algorithm. When you debug a neural network that is not learning, knowing backpropagation tells you what might be going wrong.

This is the algorithm that built the modern world of AI.


Next: Limitations and Criticism →