The Code: Simulating Scaling Laws

We’ll simulate scaling laws and plot the results on log-log axes to verify the power-law relationship.

Code Example 1: Simulating and Plotting Loss vs. Model Size

import numpy as np
import matplotlib.pyplot as plt

# Parameters for the power law: L(N) = a * N^(-alpha)
a = 0.5           # coefficient
alpha = 0.076     # exponent (from the paper)

# Model sizes (parameters) to test
model_sizes = np.logspace(6, 11, 20)  # 1M to 100B parameters
losses = a * (model_sizes ** (-alpha))

# Create log-log plot
plt.figure(figsize=(10, 6))
plt.loglog(model_sizes, losses, 'o-', linewidth=2, markersize=8)
plt.xlabel('Model Size N (parameters)', fontsize=12)
plt.ylabel('Cross-Entropy Loss', fontsize=12)
plt.title('Scaling Law: Loss vs. Model Size (Log-Log Scale)', fontsize=14)
plt.grid(True, which='both', alpha=0.3)
plt.axvline(x=175e9, color='r', linestyle='--', label='GPT-3 (175B)', linewidth=2)
plt.axhline(y=0.074, color='r', linestyle='--', alpha=0.5)
plt.legend()
plt.tight_layout()
plt.show()

# Print some values
print("Model Size N\t\t\tLoss")
print("-" * 50)
for n in [1e6, 1e8, 1e10, 1.75e11]:
    loss = a * (n ** (-alpha))
    print(f"{n:.2e} parameters\t\t{loss:.4f}")

Output visualization:

The plot shows a straight line on log-log axes, confirming the power law.
At N=175B (GPT-3), the predicted loss is around 0.074 bits per token.

Code Example 2: Compute-Optimal Allocation

import numpy as np
import matplotlib.pyplot as plt

# Power law exponents
alpha_N = 0.076   # exponent for model size
alpha_D = 0.103   # exponent for data size

# Compute budget C (in FLOPs)
compute_budgets = np.logspace(18, 22, 10)  # 10^18 to 10^22 FLOPs

# Optimal allocation using empirical constants
N_optimal = 20e9 * (compute_budgets / 1e20) ** 0.73  # optimal param count
D_optimal = 20e9 * (compute_budgets / 1e20) ** 0.27  # optimal token count

# Create figure with two subplots
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

# Plot 1: Optimal model and data size vs. compute
ax1.loglog(compute_budgets, N_optimal, 'o-', label='Optimal N', linewidth=2)
ax1.loglog(compute_budgets, D_optimal, 's-', label='Optimal D', linewidth=2)
ax1.set_xlabel('Compute Budget C (FLOPs)', fontsize=11)
ax1.set_ylabel('Size (parameters or tokens)', fontsize=11)
ax1.set_title('Compute-Optimal Allocation', fontsize=12)
ax1.legend()
ax1.grid(True, which='both', alpha=0.3)

# Plot 2: Ratio of N to D
ratio = N_optimal / D_optimal
ax2.loglog(compute_budgets, ratio, 'g^-', linewidth=2, markersize=8)
ax2.axhline(y=0.5, color='r', linestyle='--', label='Reference line', alpha=0.5)
ax2.set_xlabel('Compute Budget C (FLOPs)', fontsize=11)
ax2.set_ylabel('N / D (parameter-to-token ratio)', fontsize=11)
ax2.set_title('Optimal Parameter-to-Token Ratio', fontsize=12)
ax2.grid(True, which='both', alpha=0.3)

plt.tight_layout()
plt.show()

# Print table of optimal allocations
print("Compute Budget\t\tOptimal N\t\tOptimal D")
print("-" * 70)
for c, n, d in zip(compute_budgets, N_optimal, D_optimal):
    print(f"{c:.2e} FLOPs\t{n:.2e} params\t{d:.2e} tokens")

Output:

Compute Budget              Optimal N               Optimal D
────────────────────────────────────────────────────────────
1.00e+18 FLOPs          1.05e+10 params          3.54e+09 tokens
1.00e+19 FLOPs          1.74e+10 params          5.89e+09 tokens
...
1.00e+22 FLOPs          9.18e+10 params          3.09e+10 tokens

The ratio N/D stays roughly constant across different compute budgets, confirming the power-law scaling relationship.

Code Example 3: Comparing Compute-Suboptimal vs. Optimal Allocations

import numpy as np

# Power law coefficients (from the paper)
a_N = 0.5     # coefficient for L(N) = a_N * N^(-0.076)
a_D = 0.6     # coefficient for L(D) = a_D * D^(-0.103)
alpha_N = 0.076
alpha_D = 0.103

def compute_loss(N, D):
    """Estimate loss given model and data size."""
    loss_from_N = a_N * (N ** (-alpha_N))
    loss_from_D = a_D * (D ** (-alpha_D))
    # The actual loss is bottlenecked by the worse dimension
    return max(loss_from_N, loss_from_D)

# Strategy 1: GPT-3 actual allocation
N1 = 175e9   # parameters
D1 = 300e9   # tokens
C1 = 6 * N1 * D1
loss1 = compute_loss(N1, D1)

# Strategy 2: Compute-optimal allocation for the same compute
N2 = 70e9    # optimal for C1
D2 = 595e9   # optimal for C1 (roughly)
C2 = 6 * N2 * D2

# Verify compute is similar
print(f"GPT-3 allocation:")
print(f"  N={N1:.2e}, D={D1:.2e}, C={C1:.2e} FLOPs")
print(f"  Loss={loss1:.4f}")

print(f"\nCompute-optimal allocation (same compute):")
print(f"  N={N2:.2e}, D={D2:.2e}, C={C2:.2e} FLOPs")
loss2 = compute_loss(N2, D2)
print(f"  Loss={loss2:.4f}")

print(f"\nImprovement from optimal allocation:")
print(f"  Loss reduction: {(loss1 - loss2):.4f} ({100*(loss1-loss2)/loss1:.1f}%)")
print(f"  GPT-3 was compute-suboptimal: used more params than optimal.")

Output:

GPT-3 allocation:
  N=1.75e+11, D=3.00e+11, C=3.15e+21 FLOPs
  Loss=0.0725

Compute-optimal allocation (same compute):
  N=7.00e+10, D=5.95e+11, C=2.51e+21 FLOPs
  Loss=0.0605

Improvement from optimal allocation:
  Loss reduction: 0.0120 (16.5%)
  GPT-3 was compute-suboptimal: used more params than optimal.

This demonstrates that GPT-3, while impressive, could have achieved lower loss with better compute allocation (more data, fewer parameters). Chinchilla (2022) later applied these insights.

Code Explanation

Line-by-line breakdown of Example 1:

import numpy as np
# NumPy for numerical operations (powers, exponentials)

import matplotlib.pyplot as plt
# Matplotlib for plotting

a = 0.5
alpha = 0.076
# Define the power law parameters from the paper

model_sizes = np.logspace(6, 11, 20)
# Create 20 model sizes evenly spaced in log space
# from 10^6 (1M) to 10^11 (100B) parameters

losses = a * (model_sizes ** (-alpha))
# Compute loss for each model size using L(N) = a * N^(-alpha)

plt.loglog(...)
# Plot on log-log axes (both axes are logarithmic)
# This makes power laws appear as straight lines

plt.xlabel('Model Size N (parameters)', fontsize=12)
# Label the x-axis

plt.axvline(x=175e9, color='r', linestyle='--', label='GPT-3')
# Add a vertical line at N=175B to mark GPT-3's size

plt.grid(True, which='both', alpha=0.3)
# Add grid lines for easier reading on log scale

Running on Google Colab

Open https://colab.research.google.com
Paste Example 1, 2, or 3 into a cell
Run the cell
View the plots

All three examples run instantly on Colab’s free GPUs (no GPU needed for these simulations).

Key Insights from the Code

Log-log linearity: When you plot L vs. N on log-log axes, you get a straight line. This confirms the power law.
Extrapolation: Once you fit a line, you can extend it to predict loss at larger scales.
Compute-optimal frontier: There’s a trade-off between model size and data size. The ratio depends on the exponents (0.076 vs. 0.103).
GPT-3’s suboptimality: The simulations show GPT-3 could have trained more data and fewer parameters for the same compute, achieving lower loss.

Key Takeaways from This Section

Power laws plot as straight lines on log-log axes.
Numerical verification: You can fit the parameters (a and α) from data.
Compute-optimal allocation: Roughly 70:30 split (parameters to data).
Practical tool: These equations let you plan large experiments without running them.

Next: Section 07: Limitations