Section 01

Context: The RLHF Bottleneck

Constitutional AI: Harmlessness from AI Feedback 2022

In 2020, InstructGPT (Paper 15) showed that training language models with human feedback makes them more helpful and safer than just predicting the next token. The method was powerful: (1) Generate candidate outputs from the model, (2) Have humans rate which one is better, (3) Train a reward model to predict human preferences, (4) Use PPO to fine-tune the model to maximize predicted reward.

By 2022, this approach — RLHF — had become the standard for training helpful AI assistants. OpenAI’s ChatGPT (December 2022) used RLHF. Google’s Lamda was trained with RLHF. Every safety-conscious lab was collecting human preference data.

But there was a massive bottleneck: you need thousands of human annotators to judge thousands of outputs.

Here’s the problem at scale:

  1. Cost: Paying thousands of people to label outputs is expensive. A typical pay rate is $10–15 per hour. For 100,000 preference labels, that’s $100,000+.

  2. Speed: Humans are slow. A human can label maybe 30–50 outputs per hour. Getting 100,000 labels takes thousands of person-hours. If you want to retrain your model monthly, this is a permanent operation.

  3. Inconsistency: Humans disagree. One annotator thinks a response is “helpful and honest.” Another thinks it’s “verbose and unhelpful.” Different cultural backgrounds lead to different judgements about what content is harmful. What’s offensive in one country is acceptable in another. The reward model sees contradictory signals.

  4. Bias: Human annotators have biases. They prefer responses that sound like them. They rate outputs from people of their own group as better. These biases get encoded into the reward model.

  5. Psychological burden: The worst part: annotators who judge harmful content get burned out. If you ask humans to rate outputs that involve violence, self-harm, abuse, or hate speech — even to say “No, this is bad” — it takes a toll. Studies show content moderation leads to depression, anxiety, and PTSD. Anthropic knew this from experience running Constitutinal AI experiments with both human and AI feedback.

By 2022, the bottleneck was clear: RLHF is limited by human annotator capacity.

Anthropic’s insight: What if you didn’t use humans to judge harmfulness? What if you wrote down your principles for what is harmful and what is helpful, and used an AI to apply those principles?

That’s Constitutional AI.