Impact: What Changed After This Paper
Chain-of-Thought (CoT) was published in January 2022. Within months, it became the standard approach for reasoning tasks in large language models. Here’s what changed:
Immediate Follow-Ups (2022–2023)
Zero-Shot CoT (Kojima et al., Feb 2022)
The problem: You need human-written reasoning examples for standard CoT. What if you don’t have them?
The solution: Just add “Let’s think step by step” to your prompt.
Standard zero-shot:
Q: A store has 50 books. They receive 30 more. They sell 20.
A: ___
Zero-shot CoT:
Q: A store has 50 books. They receive 30 more. They sell 20.
Let's think step by step.
A: ___
Result: On GSM8K, zero-shot CoT achieved 41% accuracy on GPT-3 (without any examples). That’s compared to 17% with standard zero-shot.
Impact: Made CoT accessible to any task, any domain. No need to craft examples.
Self-Consistency (Wang et al., Mar 2022)
The problem: A single reasoning chain might be wrong. What if you sample multiple chains?
The solution: Generate K different reasoning chains, extract K different answers, and take the majority vote.
Sample 1: Reasoning → Answer: 60
Sample 2: Reasoning → Answer: 60
Sample 3: Reasoning → Answer: 65
Sample 4: Reasoning → Answer: 60
Sample 5: Reasoning → Answer: 60
Final answer (majority): 60
Result: On GSM8K, self-consistency achieved 71% on text-davinci-002 (vs. 58% single CoT).
Impact: Pushed accuracy higher; used in production systems; standard practice for high-stakes reasoning.
Program-of-Thought (PoT) / Code-as-Reasoning (Gao et al., 2022)
The problem: Language models are good at generating code. What if we use code execution instead of natural language reasoning?
The solution: Generate Python code to solve the problem, then execute it.
Q: A store has 50 books. They receive 30 more. They sell 20.
A:
starting_books = 50
received = 30
sold = 20
final = starting_books + received - sold
print(final)
# Output: 60
Result: Eliminated unfaithful reasoning (if code runs, the answer is correct).
Impact: Used in production (e.g., Wolfram|Alpha integration, tool-using agents).
Tree-of-Thought (ToT) (Yao et al., May 2023)
The problem: Linear chains of thought explore one path. What if you explore multiple reasoning trees?
The solution: Use a tree search algorithm to explore multiple reasoning paths and their continuations.
Q: Solve a complex puzzle
Root (question)
/ | \
Path1 Path2 Path3
/ | \
Goal Dead-end Continue...
Result: Better performance on complex tasks (e.g., 73% on Game of 24 vs. 66% with CoT).
Impact: Shifted thinking from linear to branching reasoning; paved the way for search-based planning.
Adoption in Production Systems
ChatGPT and Claude
Both ChatGPT (OpenAI) and Claude (Anthropic) use chain-of-thought prompting internally:
- They generate reasoning steps before answering complex questions
- They use variants like Constitutional AI (Anthropic) that emphasize step-by-step reasoning
- They show reasoning to users for transparency
Reasoning Models: OpenAI o1 and DeepSeek R1
In late 2024–2025, the “reasoning models” era emerged:
OpenAI o1 (November 2024):
- Explicitly allocates test-time compute to reasoning
- Generates extended internal reasoning chains before producing outputs
- Directly inspired by CoT, but at inference time with massive compute budgets
DeepSeek R1 (January 2025):
- Similar approach: long reasoning chains, then answers
- Open-source alternative to o1
Both treat reasoning as a first-class citizen in the model architecture.
Theoretical Insights
CoT revealed important properties of large language models:
1. Emergent Reasoning
- Reasoning emerges at scale (100B+), not below
- This led to renewed interest in scaling laws and emergent capabilities
- Similar patterns observed later in code generation, instruction-following, etc.
2. In-Context Learning
- CoT demonstrated that models can learn procedures from examples, not just patterns
- This led to more sophisticated in-context learning research (prompt engineering, retrieval-augmented generation)
3. Decoupling Generation from Computation
- CoT showed that showing work (generation) improves results
- This insight is central to recent work on test-time compute (spending more inference time on hard problems)
Cascading Research
Many subsequent papers built on CoT:
| Paper | Contribution |
|---|---|
| Least-to-Most Prompting | Solve sub-problems before harder problems |
| Decompose-Then-Integrate | Break complex tasks into parts, integrate results |
| Faithful CoT Explanation | Verify that reasoning actually matches the model’s computation |
| Automatic CoT | Learn to generate CoT examples automatically (no humans) |
| RLHF with CoT | Train reward models that prefer step-by-step reasoning |
| Chain-of-Code | Alternate between natural language and code for reasoning |
Impact on InstructGPT and RLHF (Paper 15)
CoT directly influenced how InstructGPT (and later ChatGPT) was trained:
- Reward models (RLHF) gave higher scores to outputs with reasoning chains
- Fine-tuning incentivized the model to explain its thinking
- This became the standard for instruction-following models
Key Insight: Cost vs. Benefit
CoT’s impact stems from a fundamental insight:
For free (just prompt engineering), you get:
- 2-3× accuracy improvement on reasoning tasks
- Better explainability (users see reasoning)
- Better debugging (easier to catch errors)
For small cost:
- 2-3× longer prompts (slightly higher token cost)
- Slightly higher latency (more tokens to generate)
This favorable trade-off made CoT ubiquitous.
Long-Term Legacy
Today, in 2025:
- CoT is foundational knowledge for any LLM engineer
- Zero-shot CoT (“Let’s think step by step”) is a basic prompt engineering technique
- Self-consistency is standard for high-stakes tasks
- Reasoning-focused models (o1, R1) are the frontier of AI capability
Without this paper: The LLM field would still be optimizing prompt templates.
With this paper: We learned that reasoning is learnable, emergent, and dramatically improvable through simple prompting. That insight shaped the entire roadmap of LLM development through 2025.