Learning Rate

Appears in 3 papers

A small positive number (e.g.

As used in Paper 02 — The Perceptron: A Probabilistic Model for Information Storage and Organization in the Brain →

A small positive number (e.g. 0.1) that controls how big each weight update step is. Too high: learning overshoots and oscillates. Too low: learning is painfully slow. Choosing the right learning rate remains an important practical consideration in all neural network training today.

As used in Paper 03 — Learning Representations by Back-propagating Errors →

The η (eta) in the weight update rule w ← w - η×∂L/∂w. Controls the step size in gradient descent. Too high: overshoots, diverges. Too low: learns very slowly. Finding the right learning rate is one of the most important practical tasks in training neural networks.

As used in Paper 13 — Scaling Laws for Neural Language Models →

A hyperparameter controlling the step size in gradient descent. The scaling laws assume learning rate is optimized for each model size. Poor learning rate scheduling breaks the scaling laws.