The Math: Few-Shot Setup and Emergent Reasoning

This paper is fundamentally empirical rather than mathematical — it discovers a phenomenon through experiments, not through deriving new equations. However, we should formalize the setup and understand what’s being measured.

Prerequisites: Cross-Entropy Loss

Few-Shot Prompting Formally

A language model is a function that predicts the next token given a sequence of previous tokens:

$$P(y_t | y_1, y_2, \ldots, y_{t-1}) = \text{softmax}(\text{logits}_t)$$

In few-shot prompting, we condition the model on a context window containing examples:

$$P(y_{\text{test}} | \text{context}) = \prod_{t=1}^{n} P(y_t | y_1, \ldots, y_{t-1}, \text{context})$$

where the context is structured as:

Standard Prompting (Question-Answer):

context = [example_1_question, example_1_answer,
           example_2_question, example_2_answer,
           ...,
           test_question]

Chain-of-Thought Prompting (Question-Reasoning-Answer):

context = [example_1_question, example_1_reasoning, example_1_answer,
           example_2_question, example_2_reasoning, example_2_answer,
           ...,
           test_question]

The reasoning chain r is a sequence of intermediate tokens that explicitly decompose the problem:

$$r = [r_1, r_2, \ldots, r_m]$$

where each r_i is a token or phrase explaining a substep.

Why Does CoT Help? The Constraint Hypothesis

Without CoT, the model learns: $$P(y | x) = \text{probability distribution over final answers directly from question}$$

With CoT, the model learns: $$P(y | x) = \sum_r P(r | x) \cdot P(y | r, x)$$

By decomposing through reasoning, the model:

Generates reasoning that’s consistent with the question (high P(r | x))
Generates an answer that’s consistent with the reasoning (high P(y | r, x))

This two-step decomposition reduces the problem of predicting a complex answer into two simpler subproblems.

Intuitively: it’s easier to first decide what steps to take, then what number follows from those steps, than to directly predict the number from the question.

Accuracy as a Function of Model Size

The paper tests on models of increasing scale: 8B, 62B, 300B, 540B parameters.

Define:

$A_{\text{standard}}(m)$ = accuracy with standard prompting on model size m
$A_{\text{CoT}}(m)$ = accuracy with chain-of-thought prompting on model size m
$\Delta A(m) = A_{\text{CoT}}(m) - A_{\text{standard}}(m)$ = improvement from CoT

Empirical Results (on GSM8K benchmark):

Model Size	Standard	CoT	Improvement
8B	5%	6%	+1%
62B	13%	15%	+2%
300B	25%	35%	+10%
540B	25%	58%	+33%

The key observation:

$$\lim_{m \to \infty} \Delta A(m) \gg \Delta A(m_0)$$

for small $m_0$.

In other words, as model size increases, the benefit of CoT accelerates. This is an emergent capability: the gap widens with scale.

Worked Example: Computing Accuracy

Let’s say we run CoT prompting on the GSM8K dataset:

Total problems: 1000
Problems solved correctly with CoT: 580
Problems solved correctly with standard: 250

Accuracy with standard: $$A_{\text{standard}} = \frac{250}{1000} = 0.25 = 25%$$

Accuracy with CoT: $$A_{\text{CoT}} = \frac{580}{1000} = 0.58 = 58%$$

Improvement: $$\Delta A = 58% - 25% = 33%$$

Relative improvement: $$\text{Relative Improvement} = \frac{\Delta A}{A_{\text{standard}}} = \frac{33}{25} = 1.32 = 132%$$

This means CoT more than doubles the accuracy — a 132% relative improvement.

Why Emergence Happens: Capacity Threshold

The paper doesn’t derive a mathematical formula for why emergence happens. But the intuition is:

Small models (< 100B):

Have limited capacity to learn multi-step reasoning from examples
Cannot reliably execute multi-step procedures
CoT examples don’t change behavior much

Large models (> 100B):

Have sufficient capacity to learn the structure of reasoning from examples
Can chain together multi-step procedures
CoT examples unlock this latent capability

This is related to the capacity threshold hypothesis: there’s a minimum model capacity needed to benefit from chain-of-thought. Once you cross it, the benefit is dramatic.

Mathematically, this isn’t formalized in the paper, but think of it as:

$$\text{CoT Benefit} = \begin{cases} \text{small} & \text{if } \text{capacity} < \text{threshold} \ \text{large} & \text{if } \text{capacity} > \text{threshold} \end{cases}$$

In reality, it’s probably a smooth function, but the paper demonstrates a sharp transition around 100B parameters.

Benchmarks Tested

The paper evaluates on four benchmarks:

Benchmark	Task Type	Difficulty
GSM8K	Grade-school word math	Easy-Medium
MATH	Competition math	Hard
StrategyQA	Commonsense + multi-hop reasoning	Medium
AQuA	Algebra word problems	Medium-Hard

On all four, CoT shows improvements for large models, especially PaLM (540B):

GSM8K: 17% → 58%
MATH: 2% → 16% (bigger jump in relative terms)
StrategyQA: 58% → 74%
AQuA: 10% → 35%

The relative improvements are largest on tasks requiring multi-step reasoning — which makes sense. CoT is most valuable when you need to chain ideas together.

Why Not Just Scale?

A natural question: why not just train bigger models? Why add reasoning to prompts?

Answer: Quadratic returns diminish. The paper on Scaling Laws (Paper 13) shows that accuracy scales as log(N) with model size. To get 2× better accuracy, you need 100,000× more parameters. That’s prohibitively expensive.

CoT gives you a 2-3× improvement for free — just by changing the prompt, not the model. It’s a prompt engineering technique with the return of a model training technique.

This is why CoT was so impactful: better performance without retraining.