The Math: Few-Shot Setup and Emergent Reasoning
This paper is fundamentally empirical rather than mathematical — it discovers a phenomenon through experiments, not through deriving new equations. However, we should formalize the setup and understand what’s being measured.
Prerequisites: Cross-Entropy Loss
Few-Shot Prompting Formally
A language model is a function that predicts the next token given a sequence of previous tokens:
$$P(y_t | y_1, y_2, \ldots, y_{t-1}) = \text{softmax}(\text{logits}_t)$$
In few-shot prompting, we condition the model on a context window containing examples:
$$P(y_{\text{test}} | \text{context}) = \prod_{t=1}^{n} P(y_t | y_1, \ldots, y_{t-1}, \text{context})$$
where the context is structured as:
Standard Prompting (Question-Answer):
context = [example_1_question, example_1_answer,
example_2_question, example_2_answer,
...,
test_question]
Chain-of-Thought Prompting (Question-Reasoning-Answer):
context = [example_1_question, example_1_reasoning, example_1_answer,
example_2_question, example_2_reasoning, example_2_answer,
...,
test_question]
The reasoning chain r is a sequence of intermediate tokens that explicitly decompose the problem:
$$r = [r_1, r_2, \ldots, r_m]$$
where each r_i is a token or phrase explaining a substep.
Why Does CoT Help? The Constraint Hypothesis
Without CoT, the model learns: $$P(y | x) = \text{probability distribution over final answers directly from question}$$
With CoT, the model learns: $$P(y | x) = \sum_r P(r | x) \cdot P(y | r, x)$$
By decomposing through reasoning, the model:
- Generates reasoning that’s consistent with the question (high P(r | x))
- Generates an answer that’s consistent with the reasoning (high P(y | r, x))
This two-step decomposition reduces the problem of predicting a complex answer into two simpler subproblems.
Intuitively: it’s easier to first decide what steps to take, then what number follows from those steps, than to directly predict the number from the question.
Accuracy as a Function of Model Size
The paper tests on models of increasing scale: 8B, 62B, 300B, 540B parameters.
Define:
- $A_{\text{standard}}(m)$ = accuracy with standard prompting on model size m
- $A_{\text{CoT}}(m)$ = accuracy with chain-of-thought prompting on model size m
- $\Delta A(m) = A_{\text{CoT}}(m) - A_{\text{standard}}(m)$ = improvement from CoT
Empirical Results (on GSM8K benchmark):
| Model Size | Standard | CoT | Improvement |
|---|---|---|---|
| 8B | 5% | 6% | +1% |
| 62B | 13% | 15% | +2% |
| 300B | 25% | 35% | +10% |
| 540B | 25% | 58% | +33% |
The key observation:
$$\lim_{m \to \infty} \Delta A(m) \gg \Delta A(m_0)$$
for small $m_0$.
In other words, as model size increases, the benefit of CoT accelerates. This is an emergent capability: the gap widens with scale.
Worked Example: Computing Accuracy
Let’s say we run CoT prompting on the GSM8K dataset:
- Total problems: 1000
- Problems solved correctly with CoT: 580
- Problems solved correctly with standard: 250
Accuracy with standard: $$A_{\text{standard}} = \frac{250}{1000} = 0.25 = 25%$$
Accuracy with CoT: $$A_{\text{CoT}} = \frac{580}{1000} = 0.58 = 58%$$
Improvement: $$\Delta A = 58% - 25% = 33%$$
Relative improvement: $$\text{Relative Improvement} = \frac{\Delta A}{A_{\text{standard}}} = \frac{33}{25} = 1.32 = 132%$$
This means CoT more than doubles the accuracy — a 132% relative improvement.
Why Emergence Happens: Capacity Threshold
The paper doesn’t derive a mathematical formula for why emergence happens. But the intuition is:
Small models (< 100B):
- Have limited capacity to learn multi-step reasoning from examples
- Cannot reliably execute multi-step procedures
- CoT examples don’t change behavior much
Large models (> 100B):
- Have sufficient capacity to learn the structure of reasoning from examples
- Can chain together multi-step procedures
- CoT examples unlock this latent capability
This is related to the capacity threshold hypothesis: there’s a minimum model capacity needed to benefit from chain-of-thought. Once you cross it, the benefit is dramatic.
Mathematically, this isn’t formalized in the paper, but think of it as:
$$\text{CoT Benefit} = \begin{cases} \text{small} & \text{if } \text{capacity} < \text{threshold} \ \text{large} & \text{if } \text{capacity} > \text{threshold} \end{cases}$$
In reality, it’s probably a smooth function, but the paper demonstrates a sharp transition around 100B parameters.
Benchmarks Tested
The paper evaluates on four benchmarks:
| Benchmark | Task Type | Difficulty |
|---|---|---|
| GSM8K | Grade-school word math | Easy-Medium |
| MATH | Competition math | Hard |
| StrategyQA | Commonsense + multi-hop reasoning | Medium |
| AQuA | Algebra word problems | Medium-Hard |
On all four, CoT shows improvements for large models, especially PaLM (540B):
- GSM8K: 17% → 58%
- MATH: 2% → 16% (bigger jump in relative terms)
- StrategyQA: 58% → 74%
- AQuA: 10% → 35%
The relative improvements are largest on tasks requiring multi-step reasoning — which makes sense. CoT is most valuable when you need to chain ideas together.
Why Not Just Scale?
A natural question: why not just train bigger models? Why add reasoning to prompts?
Answer: Quadratic returns diminish. The paper on Scaling Laws (Paper 13) shows that accuracy scales as log(N) with model size. To get 2× better accuracy, you need 100,000× more parameters. That’s prohibitively expensive.
CoT gives you a 2-3× improvement for free — just by changing the prompt, not the model. It’s a prompt engineering technique with the return of a model training technique.
This is why CoT was so impactful: better performance without retraining.