Worked Example: Test-Time Compute on a Math Problem — Scaling LLM Test-Time Compute Optimally Can be More Effective than Scaling Model Parameters

Let’s walk through both strategies (best-of-N and sequential revision) on a single math problem.

Problem

“Solve for x: 2x² + 5x - 3 = 0”

Scenario Setup

Model: 3.8B LLaMA model
Single-attempt success rate on quadratic equations: p = 0.3 (30%)
Available test-time tokens: 5,000 tokens (about 5 typical math solutions)

Strategy 1: Best-of-N

Allocation: Generate 5 independent solutions.

Attempt 1 (Model sampling 1):

Let me use the quadratic formula: x = (-b ± √(b² - 4ac)) / 2a
For 2x² + 5x - 3 = 0:
a = 2, b = 5, c = -3
Discriminant: 5² - 4(2)(-3) = 25 + 24 = 49
√49 = 7
x = (-5 ± 7) / 4
x = 2/4 = 0.5 OR x = -12/4 = -3
Answer: x = 0.5 or x = -3

Verification: Plug x = 0.5 into 2x² + 5x - 3: 2(0.5)² + 5(0.5) - 3 = 2(0.25) + 2.5 - 3 = 0.5 + 2.5 - 3 = 0 ✓

Plug x = -3 into 2x² + 5x - 3: 2(-3)² + 5(-3) - 3 = 2(9) - 15 - 3 = 18 - 15 - 3 = 0 ✓

Score: 1.0 (correct)

Attempt 2 (Model sampling 2):

I'll try to factor: 2x² + 5x - 3 = (2x - 1)(x + 3) = 0
Wait, let me verify: (2x - 1)(x + 3) = 2x² + 6x - x - 3 = 2x² + 5x - 3 ✓
So 2x - 1 = 0 → x = 1/2
Or x + 3 = 0 → x = -3
Answer: x = 1/2 or x = -3

Verification: Both solutions check out.

Score: 1.0 (correct)

Attempt 3 (Model sampling 3):

Using quadratic formula:
x = (-5 ± √(25 - 4·2·(-3))) / (2·2)
x = (-5 ± √(25 + 24)) / 4
x = (-5 ± √49) / 4
x = (-5 ± 7) / 4
x₁ = 2/4 = 1/2, x₂ = -12/4 = -3
Answer: x = 1/2 or x = -3

Score: 1.0 (correct)

Attempt 4 (Model sampling 4):

Let me rearrange: 2x² + 5x - 3 = 0
I'll use completing the square... actually this is hard.
Using quadratic formula: x = -5 ± √(25 + 24) / 4 = -5 ± 7 / 4
x = 2/4 or x = -12/4
x = 0.5 or x = -3

Wait, I think I made an error somewhere. Let me double-check...
Actually those are right.
Answer: x = 0.5 or x = -3

Score: 1.0 (correct, despite the uncertainty)

Attempt 5 (Model sampling 5):

2x² + 5x - 3 = 0
Let me think... maybe x = 1?
2(1)² + 5(1) - 3 = 2 + 5 - 3 = 4 ≠ 0
Try x = 0?
0 + 0 - 3 = -3 ≠ 0
Let me use the quadratic formula...
Actually, I'll just guess: x = 2 and x = -3
Let me check x = 2: 2(4) + 5(2) - 3 = 8 + 10 - 3 = 15 ≠ 0
So that's wrong.
Answer: I'm not sure, my guesses didn't work.

Score: 0.0 (incorrect/incomplete)

Best-of-N Decision:

The verifier sees 5 solutions:

x = 0.5, -3 (Correct) ✓
x = 0.5, -3 (Correct) ✓
x = 0.5, -3 (Correct) ✓
x = 0.5, -3 (Correct) ✓
Incorrect/unsure (Wrong) ✗

The verifier checks which solutions satisfy the original equation:

Solutions 1–4 all pass verification
Solution 5 fails verification

The verifier picks solution 1 (or any of 1–4, they’re the same).

Final Answer: x = 0.5 or x = -3

Success: Yes ✓

Coverage: With 5 attempts, 4 correct, P(success) = 1 (at least one is correct)

Strategy 2: Sequential Revision

Allocation: Generate one solution, then iterate 4 times (critique + revise).

Round 1 (Initial Solution):

Question: Solve 2x² + 5x - 3 = 0
I'll use the quadratic formula: x = (-b ± √(b² - 4ac)) / 2a
a = 2, b = 5, c = -3
Discriminant = 25 + 24 = 49
√49 = 7
x = (-5 + 7) / 4 = 2/4 = 0.5
x = (-5 - 7) / 4 = -12/4 = -3
Final Answer: x = 0.5 or x = -3

Verifier Check: Both solutions are correct ✓

Critique: “Your solution is correct. You properly identified a = 2, b = 5, c = -3, calculated the discriminant correctly, and applied the quadratic formula properly. No revisions needed.”

Round 2 (Revision):

The model reads the critique:

The verifier says my solution is already correct. Let me double-check one more time:
- Check x = 0.5: 2(0.5)² + 5(0.5) - 3 = 0.5 + 2.5 - 3 = 0 ✓
- Check x = -3: 2(-3)² + 5(-3) - 3 = 18 - 15 - 3 = 0 ✓
Yes, both are correct.

Solution after round 2: x = 0.5 or x = -3 (unchanged, already optimal)

Rounds 3–4:

Since the solution is already perfect (both roots are correct), revisions don’t improve it further. The model converges to the correct answer by round 2.

Success: Yes ✓

Coverage: After 2 rounds, P(success) = 1.0 (found correct answer)

Comparison

Metric	Best-of-N	Sequential Revision
Number of forward passes	5	4 (1 initial + 3 more)
Tokens used	~5,000	~4,000
Final success	Yes (1.0)	Yes (1.0)
Confidence	High (4/5 attempts correct)	High (critique confirmed correctness)

For this problem, both strategies succeed. But the sequential revision is slightly more efficient (4 rounds vs. 5 attempts).

Hard Problem Comparison: Geometry

If the problem were harder (e.g., a complex geometry problem where p = 0.2 per attempt):

Best-of-N with 5 attempts:

P(at least one correct) = 1 - (0.8)^5 = 1 - 0.328 = 0.672 (67.2%)

Sequential Revision with 5 rounds, Δp = 0.08:

Round 1: p = 0.2
Round 2: p = 0.28
Round 3: p = 0.36
Round 4: p = 0.44
Round 5: p = 0.52 (52%)

Best-of-N wins (67% vs 52%), so for easier problems (relative to the difficulty), best-of-N is better.

But if we had more tokens (say, 10 rounds for sequential revision):

Round 10: p = 0.2 + 9 × 0.08 = 0.92 (92%)

Sequential revision now wins (92% vs 67%).

The paper’s insight: The optimal strategy depends on task difficulty and available compute. Hard problems + deep thinking beats easy problems + fast sampling.