Further Reading: Chain-of-Thought Prompting
Further Reading: Chain-of-Thought Prompting
Dive deeper into chain-of-thought reasoning and related work.
The Original Paper
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
Wei, Wang, Schuurmans, Bosma, Ichter, Xia, Chi, Le, Zhou
NeurIPS 2022 | January 2022
The foundational paper. Introduces CoT, demonstrates emergence at scale (100B+), and benchmarks on GSM8K, MATH, StrategyQA, and AQuA. Essential reading.
Key Follow-Up Papers (Read These Next)
Zero-Shot CoT: Removing the Need for Examples
Large Language Models are Zero-Shot Reasoners
Kojima, Gu, Reid, Matsuo, Iwasawa
NeurIPS 2022 | May 2022
Key insight: You don’t need human-written reasoning examples. Simply adding “Let’s think step by step” to the prompt enables reasoning on large models.
Results: On GSM8K, GPT-3 achieved 41% with zero-shot CoT (vs. 17% standard). Made CoT accessible to any task without manual example creation.
Why read it: Directly addresses the practical limitation of CoT (needing good examples). Shows that the emergent capability is so strong, even random prompts for reasoning work.
Self-Consistency: Voting on Multiple Chains
Self-Consistency Improves Chain of Thought Reasoning in Language Models
Wang, Wei, Zhou, Huang, Kumar, Liu, Shi, Chang, Cui
NeurIPS 2022 | March 2022
Key insight: Instead of generating one reasoning chain, generate multiple (e.g., 5) chains and take a majority vote on the answer. The diversity of reasoning paths compensates for individual errors.
Results: On GSM8K, self-consistency pushed text-davinci-002 from 58% to 71% (single CoT to majority vote). Now standard practice for high-stakes reasoning.
Why read it: Shows how to get even better accuracy by trading off inference cost. Also demonstrates that reasoning chains aren’t deterministic—different decoding temperatures produce different (but valid) reasoning paths.
Code-as-Reasoning: Executable Reasoning
Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks
Gao, Mao, Chen, Pasupat, Abdelaziz, Klakow, Meng, Sen
ICLR 2023 | November 2022
Key insight: Instead of generating reasoning in natural language, generate Python code that solves the problem. Then execute the code to get the answer.
Results: Eliminates unfaithful reasoning (if code runs, answer is provably correct). Achieves strong performance on numerical tasks. Code execution makes reasoning transparent and verifiable.
Why read it: Addresses a fundamental limitation of CoT (unfaithful reasoning). Bridges reasoning and tools/computation.
Tree-of-Thought: Branching Reasoning
Tree of Thoughts: Deliberate Problem Solving with Large Language Models
Yao, Yu, Zhao, Shang, Yuan, McAuliffe, Sun, Dai
NeurIPS 2023 | May 2023
Key insight: CoT explores one linear path. What if you explore multiple branches, like a search tree? Use tree search to find the best reasoning path.
Results: On Game of 24 (a puzzle game), ToT achieved 73% vs. 66% with standard CoT. Particularly strong on tasks with branching decision points.
Why read it: Extends CoT beyond linear chains to structured search. Paves the way for more sophisticated reasoning algorithms.
Related Papers on Reasoning and Prompting
Least-to-Most Prompting: Decomposing Hard Problems
Least-to-Most Prompting Enables Complex Reasoning in Large Language Models
Zhou, Schärli, Hou, Cai, Chang, Liu, Sui, Clune, Schuurmans
ICLR 2023
Key insight: Solve sub-problems in order of increasing complexity. Solve simple versions first, then use those solutions to solve harder versions.
Why relevant: Another form of problem decomposition, complementary to CoT. Useful for compositional reasoning.
Constitutional AI: Chain-of-Thought with Principles
Constitutional AI: Harmlessness from AI Feedback
Bai, Kadavath, Kundu, Askell, Kernian, Jones, Chen, Conubeer, Conerly, Drain, Ghosh, Jackson, Hernandez, Hernandez, Herrick, Joseph, Kravec, Kravtsov, Loer, Olsson, Olton, Picciotto, Saunders, Sang, Santagata, Satterfield, Schroeder, Shih, Shivakumar, Sokol, Song, Staudacher, Such, Theriault, Tindall, Tsvetkova, Tworkowski, Wang, Weiss, WeLB, Weng, Weys, Whitelaw, Wiethoff, Willson, Wirth, Witter, Xia, Yan, Zaremba, Zellers, Zhang, Zhong, Zhou, Zhuang, Zoph
EMNLP 2023
Key insight: Use CoT in an RLHF setting where the reward model is trained to evaluate reasoning steps, not just final answers. Combines CoT with Constitutional AI principles.
Why relevant: Shows how CoT integrates with instruction-following and alignment (the next paper in this series, RLHF/InstructGPT).
Faithful Reasoning: Verifying that CoT is Real
Towards Faithful Reasoning in Large Language Models with Symbolic Planning and Grounding
Thawani, Prabhumoye, Deschamps
NeurIPS 2023 Workshop
Key insight: CoT reasoning is often unfaithful. Can we verify that reasoning actually led to the answer? Proposes grounding reasoning in symbolic logic.
Why relevant: Addresses the unfaithful reasoning limitation. Important for safety-critical applications.
Benchmarks and Datasets
GSM8K: Grade-School Math
Solving Quantitative Reasoning Problems with Language Models
Cobbe, Kosaraju, Bavarian, Chen, Jun, Kaiser, Plappert, Tworek, Hilton, Nakano, Hesse, Schulman
NeurIPS 2021
The benchmark used to evaluate CoT in the original paper. 8,500 grade-school math word problems.
Access: GitHub: openai/grade-school-math
MATH: Competition-Level Mathematics
Measuring Mathematical Problem Solving With the MATH Dataset
Hendrycks, Burns, Kadavath, Ramamurti, Zhou, Basart, Wang, Carlini, Perez, Pettit
NeurIPS 2021
12,500 competition math problems (high school and undergraduate level). Much harder than GSM8K. Used to test CoT on harder reasoning.
Access: GitHub: openai/MATH
StrategyQA: Multi-Hop Reasoning
Did the Model Understand the Question?
Geva, Khot, Srikumar
ACL 2021
Multi-hop reasoning questions requiring chaining ideas across paragraphs. CoT helped on this benchmark because it forces explicit intermediate steps.
Access: GitHub: allenai/strategyqa
Blog Posts and Tutorials
Hugging Face: Chain-of-Thought Prompting
Hugging Face has excellent tutorials on prompt engineering, including detailed guides on chain-of-thought. Search “Chain-of-Thought Prompting” on huggingface.co.
Why read it: Practical implementation details, code examples, performance comparisons across models.
OpenAI Cookbook: Using Chain-of-Thought with GPT-4
OpenAI’s cookbook has examples of using CoT with their models (GPT-3.5, GPT-4).
Access: github.com/openai/openai-cookbook
Anthropic: Chain-of-Thought and Constitutional AI
Anthropic’s blog and papers on Constitutional AI explain how CoT is used in their alignment process.
Why read it: Shows how CoT integrates with RLHF and safety-focused reasoning.
Advanced Topics
Scaling Laws and Emergence
Emergent Abilities of Large Language Models
Wei, Tay, Bommasani, et al.
arXiv 2022
Comprehensive survey of emergent capabilities in LLMs, including reasoning. Positions CoT in the broader context of emergence.
Why read it: Theoretical understanding of why reasoning emerges at scale.
Test-Time Compute
Scaling Laws for Transfer
Bahri, Dyer, Kaplan, Lee, Sharma
arXiv 2021
Early work on test-time compute trade-offs. CoT is a form of test-time compute. Later work (OpenAI o1, DeepSeek R1) pushes this much further.
Why read it: Foundational concepts for understanding why models benefit from “thinking” (generating more tokens) at inference.
What’s Coming Next (2025+)
Reasoning Models: o1 and Beyond
OpenAI o1 (November 2024) and DeepSeek R1 (January 2025) represent the frontier: models that spend massive computation at inference time for reasoning.
These models directly extend the CoT insight:
- If thinking helps, allocate more compute for thinking
- Use RL to train models to reason effectively at test time
- Achieve 90%+ on MATH, 97%+ on GSM8K
These papers will likely be released in 2025. Follow OpenAI and DeepSeek’s research pages.
Quick Reference: The CoT Ecosystem (2022–2025)
2022 Jan: Chain-of-Thought Prompting (Wei et al.) ← You are here
↓
2022 Feb: Zero-Shot CoT (Kojima et al.)
↓
2022 Mar: Self-Consistency (Wang et al.)
↓
2022 May: Least-to-Most Prompting (Zhou et al.)
↓
2022 Nov: Program-of-Thoughts (Gao et al.)
↓
2023 May: Tree-of-Thoughts (Yao et al.)
↓
2023 Dec: Constitutional AI (Bai et al.)
↓
2024+: Reasoning Models (o1, R1) — massive test-time compute
Key Papers to Read in Order
- This paper: Chain-of-Thought Prompting — Foundation
- Zero-Shot CoT — Remove examples requirement
- Self-Consistency — Improve accuracy via voting
- Program-of-Thoughts — Code as reasoning
- Tree-of-Thoughts — Search over reasoning paths
Then read the next paper in this series: Paper 15: RLHF / InstructGPT — How CoT integrates with instruction-following.
Code Implementations
Official: Google Research
The original authors’ code repository:
github.com/google-research/google-research/tree/master/chain_of_thought
Includes evaluation scripts and prompt templates.
Hugging Face Transformers
Most examples work with Transformers library:
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = AutoModelForCausalLM.from_pretrained("gpt2")
prompt = "Q: ...\nA: Let me think step by step."
inputs = tokenizer.encode(prompt, return_tensors="pt")
outputs = model.generate(inputs, max_length=200)
API-Based: OpenAI, Anthropic
Most modern LLM APIs support CoT out of the box. Just include reasoning examples in your system prompt.
Tools and Extensions
Prompt Caching for CoT
Since CoT prompts are longer, prompt caching (storing repeated context) can reduce cost. OpenAI supports prompt caching for CoT examples.
Navigation: ← Back to Paper 14