Section 04

The Math: Scaling Law Equations

Scaling Laws for Neural Language Models 2020

The Math: Scaling Law Equations

Prerequisites: Mean, Variance, Standard Deviation, Cross-Entropy Loss

The Three Power Laws

Law 1: Loss vs. Model Size

L(N) = a * N^(-α)

where:
  L(N) = cross-entropy loss as a function of parameter count N
  N = number of parameters
  a = coefficient (depends on data and compute efficiency)
  α ≈ 0.076 (empirically measured)

This means: As you increase parameters, loss decreases. The exponent α quantifies how much.

Law 2: Loss vs. Data Size

L(D) = b * D^(-β)

where:
  L(D) = cross-entropy loss as a function of data tokens D
  D = number of training tokens
  b = coefficient
  β ≈ 0.103 (empirically measured)

Similarly: As you increase data, loss decreases with exponent β.

Law 3: Loss vs. Compute Budget

L(C) = c * C^(-γ)

where:
  L(C) = loss as a function of compute budget C (in FLOPs)
  C = compute (FLOPs = floating point operations)
  c = coefficient
  γ ≈ 0.16 (derived from above two laws)

Relationship: C ≈ 6 * N * D
(This is an empirical constant: 6 FLOPs per parameter per token)

Worked Example 1: Computing Loss for a Model

Given:

  • Model size: N = 10 billion parameters
  • Data size: D = 100 billion tokens
  • Coefficients: a = 0.6, b = 0.8

Find: The expected loss.

Step 1: Compute loss from model size

L_N = a * N^(-α)
L_N = 0.6 * (10B)^(-0.076)

Calculate:
10B = 10^10
log(10^10) = 10
10^(-10 * 0.076) = 10^(-0.76)
10^(-0.76) ≈ 0.174  (using a calculator or log tables)

L_N = 0.6 * 0.174 ≈ 0.104

Step 2: Compute loss from data size

L_D = b * D^(-β)
L_D = 0.8 * (100B)^(-0.103)

Calculate:
100B = 10^11
log(10^11) = 11
10^(-11 * 0.103) = 10^(-1.133)
10^(-1.133) ≈ 0.0735

L_D = 0.8 * 0.0735 ≈ 0.059

Step 3: Combine losses

In practice, losses don’t simply add. The paper uses:

L ≈ min(L_N, L_D) + correction_term

Or more accurately, take a weighted combination. For simplicity:
L_combined ≈ max(L_N, L_D)  (the model is bottlenecked by whichever is higher)

In this case:

L ≈ max(0.104, 0.059) = 0.104 bits per token

The model is bottlenecked by data size (L_D is lower, so improving data helps more).

Worked Example 2: Compute-Optimal Allocation

Given: Compute budget C = 10^21 FLOPs

Find: The optimal model size N and data size D to minimize loss.

Step 1: Relationship between C, N, and D

C = 6 * N * D
10^21 = 6 * N * D
N * D = (10^21) / 6 ≈ 1.67 * 10^20

Step 2: Optimal exponent ratio

From the power laws:

α_N = 0.076 (exponent for N)
α_D = 0.103 (exponent for D)

The ratio is: α_D / α_N = 0.103 / 0.076 ≈ 1.355

For optimal allocation, the exponents should balance:

N_opt * D_opt should satisfy:
N ∝ C^(α_D / (α_N + α_D)) = C^(0.103 / 0.179) ≈ C^0.575

Wait, let me recalculate. The correct formula is:
N_opt ∝ C^(0.73)  (empirically found)
D_opt ∝ C^(0.27)  (empirically found)

So:
N * D ≈ C^(0.73 + 0.27) = C^1 = C  (which matches 6*N*D ≈ C, ignoring the constant)

Step 3: Calculate the optimal values

N_opt = 20B * (C / 10^20)^0.73

For C = 10^21:
N_opt = 20B * (10^21 / 10^20)^0.73
N_opt = 20B * (10)^0.73
N_opt = 20B * 5.37 ≈ 107B parameters

D_opt = 20B * (C / 10^20)^0.27

For C = 10^21:
D_opt = 20B * (10)^0.27
D_opt = 20B * 1.86 ≈ 37B tokens

Verify: N * D = 107B * 37B ≈ 4 * 10^21
Compare: 6 * N * D = 6 * 4 * 10^21 / 6 = 4 * 10^21? 
(The constant 6 adjusts for this.)

So with 10^21 FLOPs of compute:

  • Train a 107 billion parameter model
  • On 37 billion tokens of data

(Note: GPT-3 used 175B parameters on 300B tokens, which is compute-suboptimal according to these laws. It used more parameters than optimal for its compute budget—later work by Chinchilla corrected this.)

Worked Example 3: Comparing Two Strategies

Strategy A: N = 70B, D = 200B
Strategy B: N = 100B, D = 140B

Both use roughly the same compute: C ≈ 6 * 70B * 200B = 84 * 10^21 and C ≈ 6 * 100B * 140B = 84 * 10^21.

Which has better loss?

Strategy A Loss:

L_A ≈ max(0.6 * (70B)^(-0.076), 0.8 * (200B)^(-0.103))

L_N = 0.6 * (70B)^(-0.076)
     = 0.6 * (7 * 10^10)^(-0.076)
     = 0.6 * 10^(-10*0.076) * 7^(-0.076)
     ≈ 0.6 * 0.167 * 0.95 ≈ 0.095

L_D = 0.8 * (200B)^(-0.103)
     = 0.8 * (2 * 10^11)^(-0.103)
     = 0.8 * 10^(-11*0.103) * 2^(-0.103)
     ≈ 0.8 * 0.074 * 0.93 ≈ 0.055

L_A ≈ max(0.095, 0.055) = 0.095

Strategy B Loss:

L_N = 0.6 * (100B)^(-0.076)
     ≈ 0.6 * 10^(-10.76) * 1 ≈ 0.088

L_D = 0.8 * (140B)^(-0.103)
     ≈ 0.8 * 10^(-11.276) * 1.4^(-0.103)
     ≈ 0.8 * 0.053 * 0.98 ≈ 0.041

L_B ≈ max(0.088, 0.041) = 0.088

Comparison: L_B (0.088) < L_A (0.095)

Strategy B (more parameters, less data) is slightly better for the same compute, because:

  • The N-bottleneck (0.088) is lower than the D-bottleneck in B
  • The optimal ratio is ~2.7x more parameters relative to data (in exponent units)

Key Equations Summary

Power Laws:
L(N) = a * N^(-0.076)
L(D) = b * D^(-0.103)
L(C) = c * C^(-0.16)

Compute budget:
C = 6 * N * D

Compute-optimal frontier:
N_opt ∝ C^0.73
D_opt ∝ C^0.27

Log-linear form (for plotting on log-log axes):
log(L) = log(a) - α * log(N)
(This is a line with slope -α on log-log axes.)

Key Takeaways from This Section

  • Power laws: Three equations relating loss to N, D, and C.
  • Exponents: 0.076 for N, 0.103 for D, 0.16 for C.
  • Compute-optimal: 73% to parameters, 27% to data (approximately).
  • Worked examples: Calculating loss and optimal allocations with real numbers.
  • The key insight: These relationships are smooth and predictable, enabling planning.

Next: Section 05: Worked Example