Paper 14
Intermediate

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

Wei, Wang, Schuurmans, Bosma, Ichter, Xia, Chi, Le, Zhou · NeurIPS 2022
Read on arXiv


What This Paper Did

For years, large language models could solve simple problems but failed spectacularly on multi-step reasoning tasks — arithmetic, commonsense logic, strategy questions. Give GPT-3 a grade-school word problem and it would guess randomly. This paper shows why: the model isn’t reasoning; it’s just pattern-matching to the nearest example in its training data.

The fix is startlingly simple: show the model how to reason.

Instead of showing few-shot examples as (question, answer) pairs:

Q: 5 + 7 = ?
A: 12

Q: 12 + 8 = ?
A: 20

Show them as (question, step-by-step chain of thought, answer) triples:

Q: 5 + 7 = ?
A: First I add 5 + 7 = 12
   So the answer is 12

Q: 12 + 8 = ?
A: First I add 12 + 8 = 20
   So the answer is 20

When a large language model (100B+ parameters) sees these examples with intermediate reasoning steps, it learns to generate its own reasoning chains before jumping to the answer. The result is dramatic:

  • Standard prompting on 540B PaLM: 17% accuracy on GSM8K (grade-school math)
  • Chain-of-thought prompting on same model: 58% accuracy — more than 3× better
  • Critical insight: The effect is almost zero on smaller models (below ~100B parameters), but massive on large ones. Reasoning emerges only at scale.

The paper tests this on four benchmarks:

  1. GSM8K: Grade-school math word problems
  2. MATH: Competition-level math (harder)
  3. StrategyQA: Multi-hop commonsense reasoning
  4. AQuA: Algebra word problems

On all of them, chain-of-thought prompting turns a failing small model into a passing large model.

Key equations:

Standard few-shot setup:

Prompt: {(x₁, y₁), (x₂, y₂), ..., (xₖ, yₖ), x_test}
Model generates: y_test

Chain-of-thought few-shot setup:

Prompt: {(x₁, r₁, y₁), (x₂, r₂, y₂), ..., (xₖ, rₖ, yₖ), x_test}
Model generates: r_test, then y_test

Where r is the reasoning chain — the intermediate steps shown to the model.


The Indian Analogy

Imagine a student writing a Class 10 maths exam. The student knows the answer to every problem, but rushes through:

Q: A train leaves Delhi at 60 km/h. A second train leaves 2 hours later at 80 km/h.
   When do they meet?
A: 4 hours

The student wrote the answer in 10 seconds — no steps, just intuition. They got it wrong because they skipped mental steps and can’t catch their own mistake.

Now the same student slows down and writes each step:

Q: A train leaves Delhi at 60 km/h. A second train leaves 2 hours later at 80 km/h.
   When do they meet?

A: The first train travels for 2 hours before the second starts.
   Distance covered: 60 × 2 = 120 km
   When the second train starts, they're 120 km apart.
   Their relative speed: 80 - 60 = 20 km/h
   Time to close the gap: 120 / 20 = 6 hours
   So they meet 6 hours after the second train starts.

By writing out each step, the student catches logical gaps, verifies arithmetic, and arrives at the right answer. They can even spot if they made a mistake: “Wait, if the relative speed is 20 km/h, they should meet in 6 hours, not 4.”

Chain-of-thought prompting teaches language models the same discipline: show your work. When the model generates intermediate reasoning steps, it can catch its own errors and solve problems it would have bungled by jumping straight to the answer.


Read in This Order

SectionWhat You Will LearnDifficultyTime
01 — ContextHistorical moment: why LLMs were failing at reasoning, despite being “intelligent”🟢 beginner8 min
02 — The ProblemConcrete examples of standard prompting failures🟢 beginner7 min
03 — The IdeaWhat CoT is, why it works, the intuition🟡 intermediate10 min
04 — The MathFormal setup, few-shot notation, emergent-capability hypothesis🟡 intermediate8 min
05 — Worked ExampleStep-by-step trace on a real GSM8K problem🟢 beginner6 min
06 — The CodeHow to structure and run a CoT prompt🟡 intermediate8 min
07 — LimitationsWhen CoT fails, why, and what we don’t know🟡 intermediate6 min
08 — ImpactWhat changed: follow-ups, cascading innovations🟡 intermediate5 min
09 — SummaryOne-sentence recap, glossary, further reading🟢 beginner3 min

Before You Read: Math You Need

No heavy mathematics in this paper — it’s about empirical discovery. But you should know:

  • Cross-entropy loss: How we measure if a model is predicting the right token. Cross-Entropy Loss
  • Few-shot learning: How we show examples in the prompt to teach the model. Covered in Paper 12 (GPT-3).

Architecture: Standard vs. Chain-of-Thought Prompting

STANDARD PROMPTING:
    Input Prompt
         |
    [Example 1: Q → A]
    [Example 2: Q → A]
         |
    [Test Question]
         |
    Language Model (decode)
         |
    Output: Answer (direct, often wrong)


CHAIN-OF-THOUGHT PROMPTING:
    Input Prompt
         |
    [Example 1: Q → Reasoning Chain → A]
    [Example 2: Q → Reasoning Chain → A]
         |
    [Test Question]
         |
    Language Model (decode reasoning)
         |
    Generate: Intermediate Steps
         |
    Language Model (continue decoding)
         |
    Output: Answer (after showing work)


KEY INSIGHT:
- Model size matters: CoT helps at 100B+, barely helps at <50B
- This reveals "emergent reasoning" — a capability that only manifests at scale

Paper 13: Scaling Laws of Neural Language Models | Paper 15: Training Language Models to Follow Instructions with Human Feedback

Discussion

Questions about this paper? Spotted something unclear? Start a discussion below — powered by GitHub, no separate account needed.