Section 04

The Math: Few-Shot Setup and Emergent Reasoning

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models 2022

The Math: Few-Shot Setup and Emergent Reasoning

This paper is fundamentally empirical rather than mathematical — it discovers a phenomenon through experiments, not through deriving new equations. However, we should formalize the setup and understand what’s being measured.

Prerequisites: Cross-Entropy Loss

Few-Shot Prompting Formally

A language model is a function that predicts the next token given a sequence of previous tokens:

$$P(y_t | y_1, y_2, \ldots, y_{t-1}) = \text{softmax}(\text{logits}_t)$$

In few-shot prompting, we condition the model on a context window containing examples:

$$P(y_{\text{test}} | \text{context}) = \prod_{t=1}^{n} P(y_t | y_1, \ldots, y_{t-1}, \text{context})$$

where the context is structured as:

Standard Prompting (Question-Answer):

context = [example_1_question, example_1_answer,
           example_2_question, example_2_answer,
           ...,
           test_question]

Chain-of-Thought Prompting (Question-Reasoning-Answer):

context = [example_1_question, example_1_reasoning, example_1_answer,
           example_2_question, example_2_reasoning, example_2_answer,
           ...,
           test_question]

The reasoning chain r is a sequence of intermediate tokens that explicitly decompose the problem:

$$r = [r_1, r_2, \ldots, r_m]$$

where each r_i is a token or phrase explaining a substep.

Why Does CoT Help? The Constraint Hypothesis

Without CoT, the model learns: $$P(y | x) = \text{probability distribution over final answers directly from question}$$

With CoT, the model learns: $$P(y | x) = \sum_r P(r | x) \cdot P(y | r, x)$$

By decomposing through reasoning, the model:

  1. Generates reasoning that’s consistent with the question (high P(r | x))
  2. Generates an answer that’s consistent with the reasoning (high P(y | r, x))

This two-step decomposition reduces the problem of predicting a complex answer into two simpler subproblems.

Intuitively: it’s easier to first decide what steps to take, then what number follows from those steps, than to directly predict the number from the question.

Accuracy as a Function of Model Size

The paper tests on models of increasing scale: 8B, 62B, 300B, 540B parameters.

Define:

  • $A_{\text{standard}}(m)$ = accuracy with standard prompting on model size m
  • $A_{\text{CoT}}(m)$ = accuracy with chain-of-thought prompting on model size m
  • $\Delta A(m) = A_{\text{CoT}}(m) - A_{\text{standard}}(m)$ = improvement from CoT

Empirical Results (on GSM8K benchmark):

Model SizeStandardCoTImprovement
8B5%6%+1%
62B13%15%+2%
300B25%35%+10%
540B25%58%+33%

The key observation:

$$\lim_{m \to \infty} \Delta A(m) \gg \Delta A(m_0)$$

for small $m_0$.

In other words, as model size increases, the benefit of CoT accelerates. This is an emergent capability: the gap widens with scale.

Worked Example: Computing Accuracy

Let’s say we run CoT prompting on the GSM8K dataset:

  • Total problems: 1000
  • Problems solved correctly with CoT: 580
  • Problems solved correctly with standard: 250

Accuracy with standard: $$A_{\text{standard}} = \frac{250}{1000} = 0.25 = 25%$$

Accuracy with CoT: $$A_{\text{CoT}} = \frac{580}{1000} = 0.58 = 58%$$

Improvement: $$\Delta A = 58% - 25% = 33%$$

Relative improvement: $$\text{Relative Improvement} = \frac{\Delta A}{A_{\text{standard}}} = \frac{33}{25} = 1.32 = 132%$$

This means CoT more than doubles the accuracy — a 132% relative improvement.

Why Emergence Happens: Capacity Threshold

The paper doesn’t derive a mathematical formula for why emergence happens. But the intuition is:

Small models (< 100B):

  • Have limited capacity to learn multi-step reasoning from examples
  • Cannot reliably execute multi-step procedures
  • CoT examples don’t change behavior much

Large models (> 100B):

  • Have sufficient capacity to learn the structure of reasoning from examples
  • Can chain together multi-step procedures
  • CoT examples unlock this latent capability

This is related to the capacity threshold hypothesis: there’s a minimum model capacity needed to benefit from chain-of-thought. Once you cross it, the benefit is dramatic.

Mathematically, this isn’t formalized in the paper, but think of it as:

$$\text{CoT Benefit} = \begin{cases} \text{small} & \text{if } \text{capacity} < \text{threshold} \ \text{large} & \text{if } \text{capacity} > \text{threshold} \end{cases}$$

In reality, it’s probably a smooth function, but the paper demonstrates a sharp transition around 100B parameters.

Benchmarks Tested

The paper evaluates on four benchmarks:

BenchmarkTask TypeDifficulty
GSM8KGrade-school word mathEasy-Medium
MATHCompetition mathHard
StrategyQACommonsense + multi-hop reasoningMedium
AQuAAlgebra word problemsMedium-Hard

On all four, CoT shows improvements for large models, especially PaLM (540B):

  • GSM8K: 17% → 58%
  • MATH: 2% → 16% (bigger jump in relative terms)
  • StrategyQA: 58% → 74%
  • AQuA: 10% → 35%

The relative improvements are largest on tasks requiring multi-step reasoning — which makes sense. CoT is most valuable when you need to chain ideas together.

Why Not Just Scale?

A natural question: why not just train bigger models? Why add reasoning to prompts?

Answer: Quadratic returns diminish. The paper on Scaling Laws (Paper 13) shows that accuracy scales as log(N) with model size. To get 2× better accuracy, you need 100,000× more parameters. That’s prohibitively expensive.

CoT gives you a 2-3× improvement for free — just by changing the prompt, not the model. It’s a prompt engineering technique with the return of a model training technique.

This is why CoT was so impactful: better performance without retraining.