Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

Wei, Wang, Schuurmans, Bosma, Ichter, Xia, Chi, Le, Zhou · NeurIPS 2022
Read on arXiv

What This Paper Did

For years, large language models could solve simple problems but failed spectacularly on multi-step reasoning tasks — arithmetic, commonsense logic, strategy questions. Give GPT-3 a grade-school word problem and it would guess randomly. This paper shows why: the model isn’t reasoning; it’s just pattern-matching to the nearest example in its training data.

The fix is startlingly simple: show the model how to reason.

Instead of showing few-shot examples as (question, answer) pairs:

Q: 5 + 7 = ?
A: 12

Q: 12 + 8 = ?
A: 20

Show them as (question, step-by-step chain of thought, answer) triples:

Q: 5 + 7 = ?
A: First I add 5 + 7 = 12
   So the answer is 12

Q: 12 + 8 = ?
A: First I add 12 + 8 = 20
   So the answer is 20

When a large language model (100B+ parameters) sees these examples with intermediate reasoning steps, it learns to generate its own reasoning chains before jumping to the answer. The result is dramatic:

Standard prompting on 540B PaLM: 17% accuracy on GSM8K (grade-school math)
Chain-of-thought prompting on same model: 58% accuracy — more than 3× better
Critical insight: The effect is almost zero on smaller models (below ~100B parameters), but massive on large ones. Reasoning emerges only at scale.

The paper tests this on four benchmarks:

GSM8K: Grade-school math word problems
MATH: Competition-level math (harder)
StrategyQA: Multi-hop commonsense reasoning
AQuA: Algebra word problems

On all of them, chain-of-thought prompting turns a failing small model into a passing large model.

Key equations:

Standard few-shot setup:

Prompt: {(x₁, y₁), (x₂, y₂), ..., (xₖ, yₖ), x_test}
Model generates: y_test

Chain-of-thought few-shot setup:

Prompt: {(x₁, r₁, y₁), (x₂, r₂, y₂), ..., (xₖ, rₖ, yₖ), x_test}
Model generates: r_test, then y_test

Where r is the reasoning chain — the intermediate steps shown to the model.

The Indian Analogy

Imagine a student writing a Class 10 maths exam. The student knows the answer to every problem, but rushes through:

Q: A train leaves Delhi at 60 km/h. A second train leaves 2 hours later at 80 km/h.
   When do they meet?
A: 4 hours

The student wrote the answer in 10 seconds — no steps, just intuition. They got it wrong because they skipped mental steps and can’t catch their own mistake.

Now the same student slows down and writes each step:

Q: A train leaves Delhi at 60 km/h. A second train leaves 2 hours later at 80 km/h.
   When do they meet?

A: The first train travels for 2 hours before the second starts.
   Distance covered: 60 × 2 = 120 km
   When the second train starts, they're 120 km apart.
   Their relative speed: 80 - 60 = 20 km/h
   Time to close the gap: 120 / 20 = 6 hours
   So they meet 6 hours after the second train starts.

By writing out each step, the student catches logical gaps, verifies arithmetic, and arrives at the right answer. They can even spot if they made a mistake: “Wait, if the relative speed is 20 km/h, they should meet in 6 hours, not 4.”

Chain-of-thought prompting teaches language models the same discipline: show your work. When the model generates intermediate reasoning steps, it can catch its own errors and solve problems it would have bungled by jumping straight to the answer.

Read in This Order

Section	What You Will Learn	Difficulty	Time
01 — Context	Historical moment: why LLMs were failing at reasoning, despite being “intelligent”	🟢 beginner	8 min
02 — The Problem	Concrete examples of standard prompting failures	🟢 beginner	7 min
03 — The Idea	What CoT is, why it works, the intuition	🟡 intermediate	10 min
04 — The Math	Formal setup, few-shot notation, emergent-capability hypothesis	🟡 intermediate	8 min
05 — Worked Example	Step-by-step trace on a real GSM8K problem	🟢 beginner	6 min
06 — The Code	How to structure and run a CoT prompt	🟡 intermediate	8 min
07 — Limitations	When CoT fails, why, and what we don’t know	🟡 intermediate	6 min
08 — Impact	What changed: follow-ups, cascading innovations	🟡 intermediate	5 min
09 — Summary	One-sentence recap, glossary, further reading	🟢 beginner	3 min

Before You Read: Math You Need

No heavy mathematics in this paper — it’s about empirical discovery. But you should know:

Cross-entropy loss: How we measure if a model is predicting the right token. Cross-Entropy Loss
Few-shot learning: How we show examples in the prompt to teach the model. Covered in Paper 12 (GPT-3).

Architecture: Standard vs. Chain-of-Thought Prompting

STANDARD PROMPTING:
    Input Prompt
         |
    [Example 1: Q → A]
    [Example 2: Q → A]
         |
    [Test Question]
         |
    Language Model (decode)
         |
    Output: Answer (direct, often wrong)


CHAIN-OF-THOUGHT PROMPTING:
    Input Prompt
         |
    [Example 1: Q → Reasoning Chain → A]
    [Example 2: Q → Reasoning Chain → A]
         |
    [Test Question]
         |
    Language Model (decode reasoning)
         |
    Generate: Intermediate Steps
         |
    Language Model (continue decoding)
         |
    Output: Answer (after showing work)


KEY INSIGHT:
- Model size matters: CoT helps at 100B+, barely helps at <50B
- This reveals "emergent reasoning" — a capability that only manifests at scale

← Paper 13: Scaling Laws of Neural Language Models | Paper 15: Training Language Models to Follow Instructions with Human Feedback →

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

What This Paper Did

The Indian Analogy

Read in This Order

Before You Read: Math You Need

Architecture: Standard vs. Chain-of-Thought Prompting

Navigation

Discussion