Limitations and Criticism

1. The vanishing gradient problem

This is the most important limitation of backpropagation — and it nearly killed deep learning again before it had properly started.

Recall how backpropagation works: it propagates the error signal backwards through the network layer by layer. At each layer, it multiplies the gradient by the derivative of the activation function.

The sigmoid’s derivative σ’(z) = σ(z)(1-σ(z)) is always between 0 and 0.25. Maximum value is 0.25, at z=0.

So at each layer, the gradient is multiplied by a number ≤ 0.25.

In a 10-layer network:

Gradient at layer 1 ≈ (original gradient) × 0.25¹⁰ = (original) × 0.000001

The gradient shrinks by a factor of a million. By the time it reaches the early layers, it is effectively zero. Early layers receive almost no learning signal — they cannot update their weights. The network cannot learn deep representations.

This is the vanishing gradient problem, first described clearly by Hochreiter in his 1991 diploma thesis and later in a 1994 paper with Schmidhuber.

It limited backpropagation to networks of at most a few layers throughout the late 1980s and 1990s.

How it was solved:

LSTM (1997): Designed a path through which gradients can flow without multiplying through many sigmoid derivatives. Covered in Paper 04.
ReLU activation (2010s): The Rectified Linear Unit f(z) = max(0,z) has derivative 1 for z>0 — so gradients do not shrink as they pass through. ReLU replaced sigmoid as the standard activation function and made very deep networks trainable.
Batch Normalisation (2015): Normalises activations at each layer, keeping them in the range where sigmoid derivatives are reasonable.

2. Local minima and saddle points

Gradient descent follows the gradient downhill — but there is no guarantee it reaches the global minimum. It might get stuck in a local minimum: a valley that is not the lowest point in the landscape.

For many years, this was considered a major practical problem. If the network gets stuck in a bad local minimum, it will never find the best possible weights.

Modern research has partially resolved this concern. In very high-dimensional spaces (millions of weights), true local minima are relatively rare. Most apparent traps are saddle points — flat regions that are downhill in some directions even if not all. Advanced optimisers like Adam use momentum to escape saddle points.

More importantly: empirically, the local minima that large networks find tend to be “good enough” — the performance loss compared to the true global minimum is small.

3. It is slow without GPUs

Backpropagation requires matrix multiplications across all layers, for every training example, repeated millions or billions of times. On a 1986 CPU, training even a simple network on a modest dataset took days.

This limited the practical scope of what could be done with backprop throughout the 1980s and 1990s. The algorithm was correct — but the hardware was too slow to realise its potential.

The GPU revolution of the 2000s and 2010s solved this. GPUs are designed for the exact kind of parallel matrix operations that backpropagation requires. Training time dropped from days to hours to minutes.

4. It was not actually new

When Rumelhart, Hinton and Williams published in 1986, backpropagation had actually been independently discovered at least three times before:

Paul Werbos described it in his 1974 PhD thesis at Harvard — but published in an obscure venue and was largely ignored
David Parker independently derived it in 1985 at MIT and distributed a technical report
Yann LeCun independently derived it in 1985 in France

The 1986 Nature paper by Rumelhart et al. is credited with launching backpropagation into mainstream awareness — not because of priority, but because of clarity of exposition, the right venue, and the right moment in history.

Science is often like this: an idea whose time has come gets discovered multiple times. Credit goes to whoever communicates it most effectively to the right audience.

5. It needs labelled data

Backpropagation requires training examples with correct labels — “this image is a cat,” “this sentence has positive sentiment.” Collecting and labelling data is expensive and slow.

For much of the history of deep learning, this was the main bottleneck. Labelling enough images for ImageNet took years of crowd-sourced human effort.

Modern approaches — self-supervised learning, large-scale pretraining, RLHF — are partly attempts to reduce dependence on human-labelled data. These are covered in later papers.

Next: What Came Next →