Section 03

The Idea: Power Laws

Scaling Laws for Neural Language Models 2020

The Idea: Power Laws

The core insight is surprisingly simple: Loss follows a power law with respect to model size, data size, and compute.

What is a Power Law?

A power law is a relationship of the form:

y = a * x^b

where:
  y = output (loss)
  x = input (model size, or data size)
  a = constant (determines the height of the curve)
  b = exponent (determines the slope)

Example:

L(N) = 0.5 * N^(-0.076)

If N = 1B: L = 0.5 * 1B^(-0.076) ≈ 0.5
If N = 10B: L = 0.5 * 10B^(-0.076) ≈ 0.44
If N = 100B: L = 0.5 * 100B^(-0.076) ≈ 0.38

Notice: As N increases, L decreases. The bigger the model, the lower the loss (better).

Why Power Laws are Special

Power laws have a special property: they’re straight lines on log-log axes.

Normal axis:          Log-log axis:
L                     log(L)
^                       ^
|  \                    |  \
|   \                   |   \
|    \___               |    \___
+------> N             +------> log(N)

Linear on log-log = power law on regular axes

This is useful because:

  1. Easy to fit: Plot your data on log-log axes, fit a line, extract the exponent.
  2. Easy to extrapolate: Extend the line to predict performance at larger scales.
  3. Universal pattern: Different domains (vision, speech, NLP) often show power laws.

The Paper’s Finding

The authors trained models with these parameters:

Model sizes (N):      1M, 10M, 100M, 1B, 10B, 100B, ...
Dataset sizes (D):    1M, 10M, 100M, 1B, 10B, 100B, ...

For each (N, D) pair, they measured cross-entropy loss. The results:

Loss vs. Model Size:
L(N) ∝ N^(-α_N)  where α_N ≈ 0.076

Loss vs. Dataset Size:
L(D) ∝ D^(-α_D)  where α_D ≈ 0.103

Loss vs. Compute:
L(C) ∝ C^(-α_C)  where α_C ≈ 0.16 (derived from above)

The key discovery: These relationships are smooth, consistent, and log-linear across 7 orders of magnitude (millions to hundreds of billions of parameters).

The Intuition

Why might this be true?

Pre-training as compression: Language models compress the data. A larger model can store more patterns. Adding parameters lets it compress more, reducing loss.

But with diminishing returns: Each additional parameter is less useful than the previous one. The curve is decreasing but slowing down (hence the power law, not linear).

Data similarly: More data means more patterns to learn. A model can fit and generalize from more data. But again, each additional token is less informative than the previous one.

The power-law exponents (0.076 for N, 0.103 for D) quantify this diminishing return.

Why This Was Surprising

Before this paper, the prevailing belief was that loss would level off—hit a plateau where adding more parameters gives almost no benefit.

Instead, the paper showed: Loss keeps decreasing smoothly. No plateau visible up to 100B+ parameters.

This suggested: “We haven’t found the plateau yet. We should keep scaling.”

This finding justified GPT-3’s 175B parameters. If the laws hold, a 175B model would indeed show much lower loss than 70B.

The Compute-Optimal Frontier

If you have a fixed compute budget C, there’s an optimal way to split it between N and D.

Too many parameters, too little data:

  • The model is too large relative to its training.
  • Overfit risk.
  • Loss doesn’t decrease as fast.

Too little parameters, too much data:

  • The model is too small to capture the data.
  • Underfit.
  • Loss plateaus early.

Optimal balance: The paper found empirically that the optimal allocation is roughly:

N_opt ∝ C^0.73
D_opt ∝ C^0.27

Or equivalently:
N_opt ∝ D_opt^2.7

(This means parameters should be about 2.7x larger than data tokens in exponent space.)

What this means: If you double your compute budget, you should:

  • Increase model size by a factor of 2^0.73 ≈ 1.66x (66% bigger)
  • Increase data size by a factor of 2^0.27 ≈ 1.21x (21% more)

The Indian Analogy

Imagine a student studying for an exam:

  • Study time (compute): Fixed at 20 hours.
  • Number of subjects (model size): How many subjects to prepare?
  • Number of problems (data): How many practice problems per subject?

If the student prepares all 20 subjects with no practice: knows the topics, but can’t solve problems. Loss is high.

If the student does 1,000 practice problems on just 2 subjects: becomes expert in 2 subjects, fails on 18 others. Loss is high.

Optimal strategy: Prepare ~15 subjects (73% of effort) and do ~100 practice problems per subject (27% of effort). This balances breadth and depth.

The scaling laws are the study strategy that maximizes exam performance for a given study time.


Key Takeaways from This Section

  • Power law: Loss ∝ N^(-0.076), Loss ∝ D^(-0.103)
  • Log-linear on log-log: Plot loss vs. size on log-log axes, you get a straight line.
  • No plateau: Loss keeps decreasing at large scales; the curve is smooth.
  • Optimal ratio: For a given compute budget, allocate 73% to model size and 27% to data size.
  • Surprising finding: The laws hold across 7 orders of magnitude, enabling extrapolation.

Next: Section 04: The Math