Vanishing Gradient

Appears in 1 paper

The problem where gradients shrink toward zero as they propagate backwards through many layers (especially through sigmoid activations).

As used in Paper 03 — Learning Representations by Back-propagating Errors →

The problem where gradients shrink toward zero as they propagate backwards through many layers (especially through sigmoid activations). Early layers receive tiny gradients and barely learn. The key limitation of backpropagation in deep or recurrent networks. Solved by LSTMs (for sequences), ReLU activations, and batch normalisation.