Worked Example: Step-by-Step Trace on a Real Problem
Let’s walk through a chain-of-thought prompting example end-to-end, using a real GSM8K problem from the benchmark.
The Setup
We’re using PaLM (540B parameters) — the model where CoT works dramatically well.
We’ll use three few-shot examples to establish the pattern, then test on a new problem.
Example 1: Simple Multi-Step Problem
Few-shot Example 1:
Q: There are 15 trees in the garden. A storm knocks down 6 trees.
How many trees are left standing?
A: I need to find how many trees remain after the storm.
Starting number of trees: 15
Trees knocked down: 6
Trees remaining = 15 - 6 = 9
Answer: 9 trees
Notice: we’re showing the model how to:
- Restate the goal
- Identify relevant numbers
- Apply the operation
- Verify and state the answer
Example 2: Multi-Step with Two Operations
Few-shot Example 2:
Q: Sarah has 12 apples. She buys 8 more apples from the market.
Then she gives 5 apples to her friend. How many apples does
she have now?
A: I need to track Sarah's apples through two transactions.
Starting apples: 12
Apples bought: 8
After buying: 12 + 8 = 20
Apples given to friend: 5
After giving: 20 - 5 = 15
Answer: 15 apples
Pattern established:
- State the problem
- Show each calculation with intermediate results
- Chain the steps together
- Give the final answer
Example 3: Multi-Step with Multiplication
Few-shot Example 3:
Q: A baker makes 3 dozen cookies. Each dozen has 12 cookies.
He sells 20 cookies. How many cookies does he have left?
A: I need to find total cookies, then subtract sold cookies.
Dozens made: 3
Cookies per dozen: 12
Total cookies made: 3 × 12 = 36 cookies
Cookies sold: 20
Cookies remaining: 36 - 20 = 16 cookies
Answer: 16 cookies
Now we’ve shown three different patterns:
- Subtraction (Example 1)
- Addition + Subtraction (Example 2)
- Multiplication + Subtraction (Example 3)
Test Question (New Problem)
Now we present a problem the model hasn’t seen before:
Q: A store orders 4 cases of juice bottles. Each case contains
24 bottles. The store already had 30 bottles in stock.
After receiving the new cases, they sell 50 bottles.
How many bottles does the store have now?
Model’s Response (Chain-of-Thought)
Here’s what PaLM (540B) generates with CoT prompting:
A: I need to find the total juice bottles after ordering and selling.
Cases ordered: 4
Bottles per case: 24
Bottles from new cases: 4 × 24 = 96 bottles
Bottles already in stock: 30
Total bottles before selling: 96 + 30 = 126 bottles
Bottles sold: 50
Bottles remaining: 126 - 50 = 76 bottles
Answer: 76 bottles
Verification (Step-by-Step Arithmetic)
Let’s verify this by hand:
Step 1: Calculate bottles from new cases $$4 \times 24 = 96 \text{ bottles}$$
Check: $4 \times 20 = 80$, $4 \times 4 = 16$, so $80 + 16 = 96$. ✓
Step 2: Add bottles already in stock $$96 + 30 = 126 \text{ bottles}$$
Check: $96 + 30 = 126$. ✓
Step 3: Subtract bottles sold $$126 - 50 = 76 \text{ bottles}$$
Check: $126 - 50 = 76$. ✓
Final Answer: 76 bottles ✓
What Happens Without CoT?
Same test question, but using standard prompting (no reasoning steps shown in few-shot examples):
Q: A store orders 4 cases of juice bottles. Each case contains
24 bottles. The store already had 30 bottles in stock.
After receiving the new cases, they sell 50 bottles.
How many bottles does the store have now?
A: 76 bottles
Wait — that’s correct! So why does the paper say standard prompting fails?
The answer: it gets lucky on this problem, but fails on harder ones.
The paper actually reports that PaLM achieves 25% accuracy with standard prompting on GSM8K. That means 3 out of 4 problems fail. This problem might be one of the lucky 25%.
A Problem Where Standard Prompting Fails
Let’s try a harder problem that’s more likely to stump standard prompting:
Q: In a school, there are 240 students. In the morning, 1/3 of the
students attend math class. Of those in math class, 1/4 are in
grade 10. After math class, 2/5 of the original 240 students
attend English class. How many students attend either math or
English class (not both)?
Standard Prompting Output (incorrect):
A: 80 students
Chain-of-Thought Output (correct):
A: I need to find students in math OR English, not both.
Math class:
- Students in math: 1/3 × 240 = 80 students
- (Note: 1/4 of 80 = 20 in grade 10, but we don't need this)
English class:
- Students in English: 2/5 × 240 = 96 students
Students in math only: 80 students
Students in English only: 96 students
Students in either (not both): 80 + 96 = 176 students
Answer: 176 students
Why CoT works here:
- The model writes “1/3 × 240 = 80” — this intermediate calculation anchors the reasoning
- The model writes “Students in either (not both)” — this reminds it of the condition
- The model chains: 80 + 96 = 176
Why standard prompting fails:
- The model sees “240 students,” “1/3,” “2/5” and pattern-matches to a number
- It might output 80 (just the first calculation)
- It might output 60 (random)
- It doesn’t reliably execute the multi-step logic
Key Insight: Robustness
Chain-of-thought prompting is more robust to harder problems because it forces explicit reasoning. Standard prompting works on easy problems by luck but fails on hard ones.
That’s why the paper’s main result is so significant:
- PaLM with standard prompting: 25% on GSM8K
- PaLM with chain-of-thought: 58% on GSM8K
More than 2× improvement, and most of it comes from problems that require 3+ steps with fractions, division, or multi-part logic.