Prerequisite Tutorials
Best-of-N Coverage Probability
The key formula for best-of-N: given N independent attempts, each correct with probability p, what is the probability that at least one is correct?
P(at least one correct | N attempts) = 1 - P(all incorrect)
= 1 - (1 - p)^N
This is the coverage — the probability the verifier sees at least one correct solution among the N tries.
Worked Example 1: MATH Benchmark
Model: GPT-4 on MATH. Single-attempt accuracy: p = 0.42
Suppose we generate N = 10 independent solutions and use a perfect verifier to check them:
P(at least one correct) = 1 - (1 - 0.42)^10
= 1 - (0.58)^10
= 1 - 0.00324
= 0.99676 ≈ 99.7%
So 10 attempts with a perfect verifier turns a 42% model into a 99.7% model.
But this assumes a perfect verifier. In reality, the verifier also makes mistakes.
Coverage with Imperfect Verifier
If the verifier has accuracy v (chance of correctly identifying which solution is right), then:
P(solve | N attempts, verifier accuracy v)
= P(at least one correct AND verifier picks it)
≈ 1 - (1-p)^N × v + correction terms
Simplified: even with a slightly imperfect verifier (v = 0.95), the coverage drops. For N = 10, p = 0.42:
Perfect verifier: P ≈ 0.997
v = 0.95 verifier: P ≈ 0.997 × 0.95 ≈ 0.947
This is still excellent, but shows the verifier quality matters.
Sequential Revision Model
For sequential revision, we need a model of how the solution improves with each round.
Simple linear improvement model: After round i, the success probability is:
p_i = p_0 + i × Δp
Where:
- p_0 = initial success probability (first attempt)
- Δp = improvement per revision round
- Maximum p_i ≤ 1.0 (can’t exceed 100%)
Worked Example 2: Sequential Revision
Model: 3.8B LLaMA on MATH. Initial success probability: p_0 = 0.2 (20%)
After each round of critique + revision, the model improves by Δp = 0.05 (5 percentage points).
Round-by-round success probability:
Round 1: p_1 = 0.2 + 0 × 0.05 = 0.2
Round 2: p_2 = 0.2 + 1 × 0.05 = 0.25
Round 3: p_3 = 0.2 + 2 × 0.05 = 0.30
Round 4: p_4 = 0.2 + 3 × 0.05 = 0.35
Round 5: p_5 = 0.2 + 4 × 0.05 = 0.40
Round 10: p_10 = 0.2 + 9 × 0.05 = 0.65
By round 10, the 20% model becomes a 65% model.
Compute Budget Trade-off
Given a fixed token budget B at test time, how do you allocate it between best-of-N and sequential revision?
Best-of-N: Each solution requires approximately K tokens (length of a typical solution). To generate N solutions: B = N × K, so N = B / K.
Sequential revision: Initial solution is K tokens. Each revision round adds roughly K tokens (critique + revised solution). For R rounds: B ≈ K × (1 + R), so R = (B / K) - 1.
Worked Example 3: Compute Budget Allocation
Budget: B = 10,000 tokens. Solution length: K = 1,000 tokens.
Option A (Best-of-N): N = 10,000 / 1,000 = 10 tries
- Coverage: 1 - (0.42)^10 ≈ 0.9968 (99.68%)
Option B (Sequential Revision): R = (10,000 / 1,000) - 1 = 9 rounds
- Final accuracy: p_0 + 9 × Δp = 0.2 + 9 × 0.05 = 0.65 (65%)
For this problem:
- Best-of-N wins (99.68% vs 65%)
Now change the budget to B = 50,000 tokens:
Option A (Best-of-N): N = 50 tries
- Coverage: 1 - (0.42)^50 ≈ 0.99999 (essentially 100%)
Option B (Sequential Revision): R = 49 rounds
- Final accuracy: 0.2 + 49 × 0.05 = 2.65 (capped at 1.0 = 100%)
The model saturates — it can’t improve beyond 100% accuracy. But in this regime, both strategies reach near-perfect accuracy with the same budget.
Cost-Benefit Analysis: When Sequential Revision Wins
For a hard problem with p_0 = 0.1 (10%), Δp = 0.05:
Best-of-N Analysis:
N = 1: P = 0.1
N = 2: P = 1 - 0.9^2 = 0.19
N = 5: P = 1 - 0.9^5 = 0.409
N = 10: P = 1 - 0.9^10 = 0.652
N = 20: P = 1 - 0.9^20 = 0.878
N = 50: P = 1 - 0.9^50 = 0.995
Sequential Revision Analysis:
Round 1: P = 0.1
Round 2: P = 0.15
Round 5: P = 0.30
Round 10: P = 0.55
Round 20: P = 1.0 (saturates)
To reach 65% accuracy:
- Best-of-N: need N ≈ 10 tries
- Sequential Revision: need R ≈ 10 rounds
Same number of rounds! But the compute is different:
- Best-of-N: 10 forward passes, each generating a full solution
- Sequential Revision: 10 critique + revision rounds (might be cheaper per round if critique is shorter)
The paper’s insight: sequential revision scales better for hard problems because each refinement directly targets the weakness, whereas best-of-N is sampling randomly.
Key Equations Summary
Best-of-N coverage:
P_N = 1 - (1 - p)^N
Sequential revision (linear model):
p_r = p_0 + r × Δp (capped at 1.0)
Compute budget constraint:
B = N × K (best-of-N)
B ≈ K × (1 + R) (sequential revision)
Optimal strategy:
If (1 - p)^N > p_0 + r × Δp for same budget → use best-of-N
If p_0 + r × Δp > (1 - (1-p)^N) for same budget → use sequential revision
The paper shows empirically that for MATH and hard reasoning tasks, sequential revision (and variants like search) is compute-optimal.