Constitutional AI: Harmlessness from AI Feedback
What This Paper Did
RLHF (Paper 15) makes models helpful and honest by training a reward model on thousands of human preference labels — but obtaining those labels is expensive, slow, and psychologically taxing for humans who must judge harmful content. Constitutional AI (CAI) replaces human harm judgement with AI feedback: write a constitution (a list of 16–18 principles for how the AI should behave), then use an AI to critique and revise model outputs against that constitution, and use the AI’s preferences (not humans’) as the signal for training the reward model. The approach has two stages:
-
SL-CAI (Supervised Learning): Generate a harmful response, ask the model to critique it against each principle in the constitution, collect the revisions as supervised training data, fine-tune the model on these self-corrected responses.
-
RL-CAI (Reinforcement Learning): Generate pairs of model outputs, ask the AI (not a human) to judge which one violates the constitution less, train a reward model on these AI-generated preferences, then use PPO to optimize the model against the constitutional reward model.
Result: Claude models trained with Constitutional AI are both helpful and harmless, with harmlessness ensured by an AI judge rather than burned-out human reviewers. The approach scales to arbitrarily large datasets of AI-generated feedback.
RLHF bottleneck:
1000 harmful examples → 1000 human annotations → expensive, slow, psychologically taxing
Constitutional AI:
1000 harmful examples → AI critiques all 1000 in parallel → SL-CAI + RLAIF → no human burnout
Key equations:
- RLAIF reward: r(x, y) = log P_RM(preferred | x, y_w, y_l)
- Bradley-Terry loss: L = -E[log σ(r(x, y_w) - r(x, y_l))]
- Same as RLHF but with AI-generated preferences instead of human labels
The Indian Analogy
Imagine you run a boarding school and need to enforce a code of conduct. The old way (RLHF): hire 1000 teachers to stand in hallways and say “No, that violates Rule 3” every time a student breaks a rule. The teachers get exhausted from constant conflict, and some start doubting themselves.
The new way (Constitutional AI): write down your code of conduct on a poster, then hire one very smart senior prefect who reads the rules every morning and spends their day asking younger students, “Did that action violate Rule 2? Why or why not? How would you rewrite the situation to follow the rules?” The senior prefect generates feedback automatically for every situation, following the written rules. The younger students learn by revising their behaviour based on the prefect’s logical critique — not arbitrary authority.
The constitution is transparent and auditable. Anyone can read the rules and see if the prefect is applying them fairly. The rules don’t change based on the mood of a human reviewer.
Comparison: RLHF vs. Constitutional AI
| Aspect | RLHF (Paper 15) | Constitutional AI |
|---|---|---|
| Preference source | Thousands of human annotators | Single AI model (the model itself or a twin) |
| Scaling | Linear in human effort; bottleneck | Exponential in compute; no human bottleneck |
| Bias | Reflects human biases (culture, mood, disagreement) | Reflects AI training data biases; consistent application |
| Auditability | Implicit (hard to know why humans chose A over B) | Explicit (constitution is written and readable) |
| Speed | Slow (humans are slow) | Fast (AI is fast) |
| Psychological burden | Humans judge harmful content; burnout risk | No humans judge harmful content directly |
| Generality | Task-specific (need humans for each task) | Generalizes via constitution principles |
Read in This Order
| Section | What You Will Learn | Difficulty | Time |
|---|---|---|---|
| 01-context | Why RLHF has a human-feedback bottleneck | 🟢 | 5 min |
| 02-the-problem | Specific failures of human labellers (inconsistency, bias, burnout) | 🟢 | 4 min |
| 03-the-idea | How constitutional critique and revision work; the intuition | 🟡 | 7 min |
| 04-the-math | Bradley-Terry reward model; critique prompt structure | 🟡 | 8 min |
| 05-worked-example | Step-by-step trace of CAI on a dangerous question | 🟡 | 7 min |
| 06-the-code | Python code showing the critique-revision loop | 🟡 | 6 min |
| 07-limitations | Constitution quality, AI critic bias, computational cost | 🟡 | 5 min |
| 08-impact | Claude 1–3, RLAIF across industry, AI governance | 🟢 | 5 min |
| 09-summary | One-sentence takeaway, what came next | 🟢 | 2 min |
Before You Read: Math Tutorials You Need
- Entropy and KL Divergence — why reward models use log-likelihood
- Bradley-Terry Model — preference modelling with paired comparisons
- Softmax and Cross-Entropy — why we use σ(r_B - r_A) for preference probability
ASCII Diagram
Old (RLHF):
Model output → Human reviewer (tired, biased) → Preference label
↓
1000 reviewers needed
Bottleneck!
New (Constitutional AI):
Model output → AI Critic (reads constitution) → Critique + revised output
↓
Use AI's feedback to train reward model
No human bottleneck
Flow of Constitutional AI:
1. Start: Model generates response
↓
2. SL-CAI: Ask AI "Does this violate Rule 1, 2, ..., N?"
↓
3. Revise: Model self-corrects based on critique
↓
4. Collect: Supervised training data (response → revised)
↓
5. RL-CAI: Generate response pairs (A, B)
↓
6. Compare: AI judge "Which violates constitution less?" → preference
↓
7. Reward: Train reward model on AI preferences (Bradley-Terry)
↓
8. Optimize: Use PPO with constitutional reward model
Discussion
Questions about this paper? Spotted something unclear? Start a discussion below — powered by GitHub, no separate account needed.