Let’s see test-time compute scaling in practice. Two quick Python experiments that run free on Google Colab.
Block 1: Best-of-N Accuracy Scaling
This shows the central insight of the paper: how accuracy grows with more attempts using the formula P(at least one correct in N attempts) = 1 - (1-p)^N.
import numpy as np
import matplotlib.pyplot as plt
# Probability of success on a single attempt for different problem difficulties
p_values = [0.1, 0.3, 0.5] # hard, medium, easy problems
n_range = np.arange(1, 51) # number of attempts (1 to 50)
plt.figure(figsize=(8, 5))
for p in p_values:
# P(at least one correct in N attempts) = 1 - (1-p)^N
success_prob = 1 - (1 - p) ** n_range
plt.plot(n_range, success_prob, label=f"p = {p} per attempt", linewidth=2)
plt.xlabel("Number of attempts (N)", fontsize=12)
plt.ylabel("P(at least one correct)", fontsize=12)
plt.title("Best-of-N: How accuracy improves with more attempts", fontsize=13, fontweight='bold')
plt.legend(fontsize=11)
plt.grid(True, alpha=0.3)
plt.axhline(0.95, color='red', linestyle='--', linewidth=1.5, label='95% target accuracy')
plt.tight_layout()
plt.savefig("best_of_n.png", dpi=100)
plt.show()
# For p=0.1 (hard problem): how many attempts to reach 95% accuracy?
p = 0.1
n_needed = np.ceil(np.log(0.05) / np.log(1 - p)) # solve 1-(1-p)^N >= 0.95
print(f"Attempts needed for 95% accuracy (p=0.1): {int(n_needed)}")
print(f"Verify: 1 - (1-0.1)^{int(n_needed)} = {1 - (1-0.1)**int(n_needed):.4f}")
What you see: The curves shoot upward most steeply at the beginning. For p=0.1 (hard problems), you need about 29 attempts to guarantee 95% accuracy. For p=0.5 (easier problems), just 5 attempts. This is why the paper says “spend more compute where you need it” — hard problems benefit dramatically from more attempts, but easy ones hit diminishing returns fast.
The maths: When p is small, the benefit of each new attempt is largest. Adding attempt N=2 to N=1 when p=0.1 improves accuracy from 10% to 19% — a 9 percentage point jump. But adding attempt N=50 to N=49 only improves from 94.9% to 94.99% — barely 0.1 percentage points. This is characteristic of exponential curves with base < 1.
Block 2: Sequential Revision Simulation
Now we compare sequential revision, where the model refines its answer iteratively. Each round improves the success probability by a fixed amount.
import numpy as np
def simulate_sequential_revision(p_initial, p_increment, n_rounds, n_trials=5000):
"""
Simulate sequential revision where each revision round
increases probability of success by p_increment.
Returns fraction of trials that eventually succeed.
"""
successes = 0
for _ in range(n_trials):
p = p_initial # start with initial success probability
solved = False
for round_num in range(n_rounds):
if np.random.random() < p: # did this round succeed?
solved = True
break
p = min(1.0, p + p_increment) # learning improves each round
if solved:
successes += 1
return successes / n_trials
# Compare strategies on a hard problem (p_initial = 0.15)
print("=" * 60)
print("Sequential revision accuracy (p_initial=0.15, +0.05 per round):")
print("=" * 60)
for rounds in [1, 3, 5, 8, 10]:
acc = simulate_sequential_revision(0.15, 0.05, rounds)
print(f" {rounds:2d} rounds: {acc:.3f}")
# Show compute-optimal crossover: for what problem difficulty should you
# use Best-of-N vs. sequential revision?
p_initial = 0.2
p_increment = 0.06
n = 5 # 5 attempts or 5 rounds (same token budget)
bon_acc = 1 - (1 - p_initial) ** n # best-of-N
seq_acc = simulate_sequential_revision(p_initial, p_increment, n)
print("\n" + "=" * 60)
print(f"Same compute budget (N=5), hard problem (p=0.2):")
print("=" * 60)
print(f" Best-of-N accuracy: {bon_acc:.3f}")
print(f" Sequential revision acc: {seq_acc:.3f}")
if bon_acc > seq_acc:
print(f" → Best-of-N wins by {bon_acc - seq_acc:.3f}")
else:
print(f" → Sequential revision wins by {seq_acc - bon_acc:.3f}")
What you see: With p_initial=0.15, even after 10 rounds, sequential revision only reaches about 56% accuracy. But Best-of-N with the same 5 attempts reaches 1 - 0.8^5 = 67.2% — a clear win. This matches the paper’s finding: Best-of-N is better for problems where the model’s reasoning is fragile (if you get the answer, it’s by luck, and different attempts have independent outcomes). Sequential revision is better when the model can iteratively improve (the second attempt learns from the first attempt’s feedback).
When to use which: The paper shows that the optimal strategy depends on problem difficulty and whether feedback helps. For MATH (reasoning that compounds), Best-of-N with a good verifier is surprisingly powerful.
Key Takeaways
-
Best-of-N scales as 1 - (1-p)^N — a simple formula with big implications. Even a 10% base success rate becomes 90%+ accuracy with 44 attempts.
-
Diminishing returns are real — each additional attempt helps less than the last. You don’t need 1000 attempts to hit 99% accuracy; often 20–50 is enough.
-
The optimal strategy changes with problem type — but in the wild, Best-of-N with a Process Reward Model to select the best solution has proven to be the real workhorse.
-
Latency matters in practice — the formula says 29 attempts, but in a production system where users wait, 5–10 is often the sweet spot.