Vanishing Gradient
The problem where gradients shrink toward zero as they propagate backwards through many layers (especially through sigmoid activations).
The problem where gradients shrink toward zero as they propagate backwards through many layers (especially through sigmoid activations). Early layers receive tiny gradients and barely learn. The key limitation of backpropagation in deep or recurrent networks. Solved by LSTMs (for sequences), ReLU activations, and batch normalisation.