Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
Wei, Wang, Schuurmans, Bosma, Ichter, Xia, Chi, Le, Zhou · NeurIPS 2022
Read on arXiv
What This Paper Did
For years, large language models could solve simple problems but failed spectacularly on multi-step reasoning tasks — arithmetic, commonsense logic, strategy questions. Give GPT-3 a grade-school word problem and it would guess randomly. This paper shows why: the model isn’t reasoning; it’s just pattern-matching to the nearest example in its training data.
The fix is startlingly simple: show the model how to reason.
Instead of showing few-shot examples as (question, answer) pairs:
Q: 5 + 7 = ?
A: 12
Q: 12 + 8 = ?
A: 20
Show them as (question, step-by-step chain of thought, answer) triples:
Q: 5 + 7 = ?
A: First I add 5 + 7 = 12
So the answer is 12
Q: 12 + 8 = ?
A: First I add 12 + 8 = 20
So the answer is 20
When a large language model (100B+ parameters) sees these examples with intermediate reasoning steps, it learns to generate its own reasoning chains before jumping to the answer. The result is dramatic:
- Standard prompting on 540B PaLM: 17% accuracy on GSM8K (grade-school math)
- Chain-of-thought prompting on same model: 58% accuracy — more than 3× better
- Critical insight: The effect is almost zero on smaller models (below ~100B parameters), but massive on large ones. Reasoning emerges only at scale.
The paper tests this on four benchmarks:
- GSM8K: Grade-school math word problems
- MATH: Competition-level math (harder)
- StrategyQA: Multi-hop commonsense reasoning
- AQuA: Algebra word problems
On all of them, chain-of-thought prompting turns a failing small model into a passing large model.
Key equations:
Standard few-shot setup:
Prompt: {(x₁, y₁), (x₂, y₂), ..., (xₖ, yₖ), x_test}
Model generates: y_test
Chain-of-thought few-shot setup:
Prompt: {(x₁, r₁, y₁), (x₂, r₂, y₂), ..., (xₖ, rₖ, yₖ), x_test}
Model generates: r_test, then y_test
Where r is the reasoning chain — the intermediate steps shown to the model.
The Indian Analogy
Imagine a student writing a Class 10 maths exam. The student knows the answer to every problem, but rushes through:
Q: A train leaves Delhi at 60 km/h. A second train leaves 2 hours later at 80 km/h.
When do they meet?
A: 4 hours
The student wrote the answer in 10 seconds — no steps, just intuition. They got it wrong because they skipped mental steps and can’t catch their own mistake.
Now the same student slows down and writes each step:
Q: A train leaves Delhi at 60 km/h. A second train leaves 2 hours later at 80 km/h.
When do they meet?
A: The first train travels for 2 hours before the second starts.
Distance covered: 60 × 2 = 120 km
When the second train starts, they're 120 km apart.
Their relative speed: 80 - 60 = 20 km/h
Time to close the gap: 120 / 20 = 6 hours
So they meet 6 hours after the second train starts.
By writing out each step, the student catches logical gaps, verifies arithmetic, and arrives at the right answer. They can even spot if they made a mistake: “Wait, if the relative speed is 20 km/h, they should meet in 6 hours, not 4.”
Chain-of-thought prompting teaches language models the same discipline: show your work. When the model generates intermediate reasoning steps, it can catch its own errors and solve problems it would have bungled by jumping straight to the answer.
Read in This Order
| Section | What You Will Learn | Difficulty | Time |
|---|---|---|---|
| 01 — Context | Historical moment: why LLMs were failing at reasoning, despite being “intelligent” | 🟢 beginner | 8 min |
| 02 — The Problem | Concrete examples of standard prompting failures | 🟢 beginner | 7 min |
| 03 — The Idea | What CoT is, why it works, the intuition | 🟡 intermediate | 10 min |
| 04 — The Math | Formal setup, few-shot notation, emergent-capability hypothesis | 🟡 intermediate | 8 min |
| 05 — Worked Example | Step-by-step trace on a real GSM8K problem | 🟢 beginner | 6 min |
| 06 — The Code | How to structure and run a CoT prompt | 🟡 intermediate | 8 min |
| 07 — Limitations | When CoT fails, why, and what we don’t know | 🟡 intermediate | 6 min |
| 08 — Impact | What changed: follow-ups, cascading innovations | 🟡 intermediate | 5 min |
| 09 — Summary | One-sentence recap, glossary, further reading | 🟢 beginner | 3 min |
Before You Read: Math You Need
No heavy mathematics in this paper — it’s about empirical discovery. But you should know:
- Cross-entropy loss: How we measure if a model is predicting the right token. Cross-Entropy Loss
- Few-shot learning: How we show examples in the prompt to teach the model. Covered in Paper 12 (GPT-3).
Architecture: Standard vs. Chain-of-Thought Prompting
STANDARD PROMPTING:
Input Prompt
|
[Example 1: Q → A]
[Example 2: Q → A]
|
[Test Question]
|
Language Model (decode)
|
Output: Answer (direct, often wrong)
CHAIN-OF-THOUGHT PROMPTING:
Input Prompt
|
[Example 1: Q → Reasoning Chain → A]
[Example 2: Q → Reasoning Chain → A]
|
[Test Question]
|
Language Model (decode reasoning)
|
Generate: Intermediate Steps
|
Language Model (continue decoding)
|
Output: Answer (after showing work)
KEY INSIGHT:
- Model size matters: CoT helps at 100B+, barely helps at <50B
- This reveals "emergent reasoning" — a capability that only manifests at scale
Navigation
← Paper 13: Scaling Laws of Neural Language Models | Paper 15: Training Language Models to Follow Instructions with Human Feedback →
Discussion
Questions about this paper? Spotted something unclear? Start a discussion below — powered by GitHub, no separate account needed.