Constitutional AI: Harmlessness from AI Feedback

What This Paper Did

RLHF (Paper 15) makes models helpful and honest by training a reward model on thousands of human preference labels — but obtaining those labels is expensive, slow, and psychologically taxing for humans who must judge harmful content. Constitutional AI (CAI) replaces human harm judgement with AI feedback: write a constitution (a list of 16–18 principles for how the AI should behave), then use an AI to critique and revise model outputs against that constitution, and use the AI’s preferences (not humans’) as the signal for training the reward model. The approach has two stages:

SL-CAI (Supervised Learning): Generate a harmful response, ask the model to critique it against each principle in the constitution, collect the revisions as supervised training data, fine-tune the model on these self-corrected responses.
RL-CAI (Reinforcement Learning): Generate pairs of model outputs, ask the AI (not a human) to judge which one violates the constitution less, train a reward model on these AI-generated preferences, then use PPO to optimize the model against the constitutional reward model.

Result: Claude models trained with Constitutional AI are both helpful and harmless, with harmlessness ensured by an AI judge rather than burned-out human reviewers. The approach scales to arbitrarily large datasets of AI-generated feedback.

RLHF bottleneck:
  1000 harmful examples → 1000 human annotations → expensive, slow, psychologically taxing
  
Constitutional AI:
  1000 harmful examples → AI critiques all 1000 in parallel → SL-CAI + RLAIF → no human burnout
  
Key equations:
  - RLAIF reward: r(x, y) = log P_RM(preferred | x, y_w, y_l)
  - Bradley-Terry loss: L = -E[log σ(r(x, y_w) - r(x, y_l))]
  - Same as RLHF but with AI-generated preferences instead of human labels

The Indian Analogy

Imagine you run a boarding school and need to enforce a code of conduct. The old way (RLHF): hire 1000 teachers to stand in hallways and say “No, that violates Rule 3” every time a student breaks a rule. The teachers get exhausted from constant conflict, and some start doubting themselves.

The new way (Constitutional AI): write down your code of conduct on a poster, then hire one very smart senior prefect who reads the rules every morning and spends their day asking younger students, “Did that action violate Rule 2? Why or why not? How would you rewrite the situation to follow the rules?” The senior prefect generates feedback automatically for every situation, following the written rules. The younger students learn by revising their behaviour based on the prefect’s logical critique — not arbitrary authority.

The constitution is transparent and auditable. Anyone can read the rules and see if the prefect is applying them fairly. The rules don’t change based on the mood of a human reviewer.

Comparison: RLHF vs. Constitutional AI

Aspect	RLHF (Paper 15)	Constitutional AI
Preference source	Thousands of human annotators	Single AI model (the model itself or a twin)
Scaling	Linear in human effort; bottleneck	Exponential in compute; no human bottleneck
Bias	Reflects human biases (culture, mood, disagreement)	Reflects AI training data biases; consistent application
Auditability	Implicit (hard to know why humans chose A over B)	Explicit (constitution is written and readable)
Speed	Slow (humans are slow)	Fast (AI is fast)
Psychological burden	Humans judge harmful content; burnout risk	No humans judge harmful content directly
Generality	Task-specific (need humans for each task)	Generalizes via constitution principles

Read in This Order

Section	What You Will Learn	Difficulty	Time
01-context	Why RLHF has a human-feedback bottleneck	🟢	5 min
02-the-problem	Specific failures of human labellers (inconsistency, bias, burnout)	🟢	4 min
03-the-idea	How constitutional critique and revision work; the intuition	🟡	7 min
04-the-math	Bradley-Terry reward model; critique prompt structure	🟡	8 min
05-worked-example	Step-by-step trace of CAI on a dangerous question	🟡	7 min
06-the-code	Python code showing the critique-revision loop	🟡	6 min
07-limitations	Constitution quality, AI critic bias, computational cost	🟡	5 min
08-impact	Claude 1–3, RLAIF across industry, AI governance	🟢	5 min
09-summary	One-sentence takeaway, what came next	🟢	2 min

Before You Read: Math Tutorials You Need

Entropy and KL Divergence — why reward models use log-likelihood
Bradley-Terry Model — preference modelling with paired comparisons
Softmax and Cross-Entropy — why we use σ(r_B - r_A) for preference probability

ASCII Diagram

Old (RLHF):
  Model output → Human reviewer (tired, biased) → Preference label
                      ↓
              1000 reviewers needed
              Bottleneck!

New (Constitutional AI):
  Model output → AI Critic (reads constitution) → Critique + revised output
                      ↓
              Use AI's feedback to train reward model
              No human bottleneck
              
Flow of Constitutional AI:
  
  1. Start: Model generates response
            ↓
  2. SL-CAI: Ask AI "Does this violate Rule 1, 2, ..., N?"
            ↓
  3. Revise: Model self-corrects based on critique
            ↓
  4. Collect: Supervised training data (response → revised)
            ↓
  5. RL-CAI: Generate response pairs (A, B)
            ↓
  6. Compare: AI judge "Which violates constitution less?" → preference
            ↓
  7. Reward: Train reward model on AI preferences (Bradley-Terry)
            ↓
  8. Optimize: Use PPO with constitutional reward model

← Paper 21: Mamba | Paper 23: Test-Time Compute →