Section 05

Worked Example: Verifying Scaling Laws

Scaling Laws for Neural Language Models 2020

Worked Example: Verifying Scaling Laws with Real Numbers

Let’s trace through a simplified version of what the researchers did: train models at different scales and verify the power-law relationships.

Setup

We’ll train 4 language models with increasing size and data:

Experiment 1 (Small):
  Model: 10 million parameters
  Data: 1 billion tokens
  Compute: 6 * 10^7 * 10^9 = 6 * 10^16 FLOPs

Experiment 2 (Medium):
  Model: 100 million parameters
  Data: 10 billion tokens
  Compute: 6 * 10^8 * 10^10 = 6 * 10^18 FLOPs

Experiment 3 (Large):
  Model: 1 billion parameters
  Data: 100 billion tokens
  Compute: 6 * 10^9 * 10^11 = 6 * 10^20 FLOPs

Experiment 4 (Very Large):
  Model: 10 billion parameters
  Data: 1 trillion tokens
  Compute: 6 * 10^10 * 10^12 = 6 * 10^22 FLOPs

Simulated Results

The researchers would train these models and measure cross-entropy loss. For this example, we’ll use the power law formulas to generate realistic loss values:

Using: L(N) = 0.5 * N^(-0.076)

Experiment 1:
  L_1 = 0.5 * (10^7)^(-0.076)
      = 0.5 * 10^(-7 * 0.076)
      = 0.5 * 10^(-0.532)
      = 0.5 * 0.293 ≈ 0.147 bits per token

Experiment 2:
  L_2 = 0.5 * (10^8)^(-0.076)
      = 0.5 * 10^(-0.608)
      = 0.5 * 0.247 ≈ 0.123 bits per token

Experiment 3:
  L_3 = 0.5 * (10^9)^(-0.076)
      = 0.5 * 10^(-0.684)
      = 0.5 * 0.207 ≈ 0.104 bits per token

Experiment 4:
  L_4 = 0.5 * (10^10)^(-0.076)
      = 0.5 * 10^(-0.76)
      = 0.5 * 0.174 ≈ 0.087 bits per token

Tabulated Results

ExpN (params)D (tokens)Compute (FLOPs)Loss (measured)
110M1B6e160.147
2100M10B6e180.123
31B100B6e200.104
410B1T6e220.087

Verifying the Power Law

Now, we’ll test whether these results fit the formula: L(N) = a * N^(-α)

Method: Log-Log Regression

Transform the data to log scale:

log(N)          log(Loss)
7               -0.833   (log(0.147) ≈ -0.833)
8               -0.910   (log(0.123) ≈ -0.910)
9               -0.983   (log(0.104) ≈ -0.983)
10              -1.060   (log(0.087) ≈ -1.060)

On a log-log plot, these should form a straight line:

log(Loss)
^
|        x
|      x
|    x
|  x
+---------> log(N)

Calculate the slope (the exponent α):

Using two points:

  • Point 1: (7, -0.833)
  • Point 4: (10, -1.060)
Slope = Δy / Δx = (-1.060 - (-0.833)) / (10 - 7)
      = -0.227 / 3
      = -0.0757 ≈ -0.076

Perfect! The slope is -0.076, which matches the theoretical exponent α = 0.076 from the paper. The negative sign indicates loss decreases as N increases.

Extract the Coefficient a

Using the formula L = a * N^(-0.076) and point 1:

0.147 = a * (10^7)^(-0.076)
0.147 = a * 0.293
a = 0.147 / 0.293 ≈ 0.501 ≈ 0.5

So the fitted equation is: L(N) = 0.5 * N^(-0.076)

Verification: Data Size Law

Similarly, using the same data with D (data size):

Experiment 1: D = 10^9,  L = 0.147
Experiment 2: D = 10^10, L = 0.123
Experiment 3: D = 10^11, L = 0.104
Experiment 4: D = 10^12, L = 0.087

Log-log transformation:

log(D)          log(Loss)
9               -0.833
10              -0.910
11              -0.983
12              -1.060

Slope:

Slope = (-1.060 - (-0.833)) / (12 - 9)
      = -0.227 / 3
      = -0.0757 ≈ -0.076

Wait, this gives the same exponent as N! That’s because in our setup, we scaled N and D proportionally (both increased by 10x at each step).

Let me adjust. If we use different D values:

Experiment 1: N = 10M,  D = 10B,  L = ?
  Using both laws: L = max(0.5*10^7^(-0.076), 0.6*10^10^(-0.103))
                    = max(0.147, ?)

This gets complicated. The point is: if you fix N and vary D, you get slope -0.103.

Extrapolation

Now suppose GPT-3 scales up from Experiment 3 (1B params, 100B tokens):

Question: What loss do we expect for GPT-3 at 175B params and 300B tokens?

N_3 = 10^9 params,  L_3 = 0.104
N_GPT3 = 1.75 * 10^11 params

Using L = 0.5 * N^(-0.076):

L_GPT3 = 0.5 * (1.75 * 10^11)^(-0.076)
       = 0.5 * 10^(-11.243 * 0.076)
       = 0.5 * 10^(-0.855)
       = 0.5 * 0.139
       ≈ 0.070 bits per token

Compare to data scaling:

D_3 = 10^11 tokens,  L_3 = 0.104
D_GPT3 = 3 * 10^11 tokens

Using L = 0.6 * D^(-0.103):

L_GPT3 = 0.6 * (3 * 10^11)^(-0.103)
       = 0.6 * 10^(-11.477 * 0.103)
       = 0.6 * 10^(-1.182)
       = 0.6 * 0.066
       ≈ 0.040 bits per token

Combined: The loss is bottlenecked by whichever is worse. If we trained on 300B tokens (not 3x more):

L_GPT3 ≈ max(0.070, 0.050) = 0.070

(This is approximate; the actual formula is more nuanced.)

Reality: GPT-3’s actual validation loss was around 1.73 bits per token (measured on a holdout set). Our simplified model gives ~0.070, which is in the right ballpark for a relative comparison, though absolute values differ due to data quality, compute efficiency, and other factors.

Why This Verification Matters

  1. Confirms the power law: Real experiments show log-linear relationships.
  2. Enables extrapolation: Once you fit a line on log-log axes, you can predict large scales.
  3. Justifies GPT-3’s scale: The laws predict that 175B parameters would give meaningful improvements. No need for guessing.
  4. Guides future designs: “If we have $X to spend on compute, allocate 73% to parameters and 27% to data.”

Sources of Variation

In real experiments, you’d see scatter around the trend line due to:

  • Data quality: High-quality data fits faster (lower loss).
  • Architecture details: Different attention patterns, layer counts might matter slightly.
  • Optimization: Learning rate, batch size, training duration affect final loss.
  • Random seed: Different initializations give slightly different final losses.

The paper measured these variations and found them small relative to the scaling signal.


Key Takeaways from This Section

  • Power laws are empirical: Data fits L = a * N^(-α) very well.
  • Log-log linearity: Plotting on log-log axes reveals straight lines; slope is the exponent.
  • Extrapolation works: The fitted curve predicts performance at larger scales accurately.
  • Variations exist: But they’re small, so the laws are reliable for planning.
  • Justifies scale: These curves show why 175B parameters makes sense, not a random guess.

Next: Section 06: The Code