Limitations of Constitutional AI — Constitutional AI: Harmlessness from AI Feedback

Constitutional AI is powerful, but it has real limitations that are important to understand.

1. Constitution Quality: Garbage In, Garbage Out

The entire system depends on the quality of the written constitution. If your constitution is poorly written, vague, or biased, the AI will learn from a biased constitution.

Example: If your constitution says “Be helpful to the user,” that’s vague. Helpful how? To whom? The AI might interpret this as “do whatever the user asks,” which could be harmful. Or it might be overly cautious and refuse to help with legitimate requests.

Real case: Anthropic’s original constitution used natural language, which is inherently ambiguous. Different principles sometimes conflict (e.g., “be helpful” vs “avoid harm” — sometimes the helpful action causes harm).

Consequence: The AI inherits the biases and blindness of the people who wrote the constitution. If the constitution was written by a homogeneous group, it will reflect their values and blind spots.

2. AI Critic Bias and Consistency Issues

The AI doing the critique is not an impartial judge. It has its own biases, learned from its training data.

Problem: If the base model is trained on biased data, it will apply biases when critiquing its own outputs. The feedback signal becomes self-reinforcing — it learns to critique in ways that are consistent with its training data, not with objective principles.

Example: If the model was trained on data where certain groups are portrayed negatively, it might not recognize bias when it critiques. It might say “This response about group X is fine” when it actually contains harmful stereotypes.

Consequence: The constitution is only as unbiased as the model applying it. You need a very careful model and very careful principle-writing to avoid garbage-in-garbage-out at the critique stage.

3. Computational Cost: Many Forward Passes

Constitutional AI requires many forward passes through the model:

Generate response
Critique for principle 1
Revise for principle 1
Critique for principle 2
Revise for principle 2
… (repeat for ~16 principles)

For SL-CAI alone, each prompt requires ~30–50 forward passes. For RL-CAI, you generate pairs and critique them, which is another ~20 passes per pair.

Consequence: Training becomes 10–100x more expensive than standard supervised fine-tuning. For Anthropic, this meant running Constitutional AI on a large cluster for weeks.

Impact: This makes it hard for smaller labs to use the approach. You need significant compute budget.

4. Cannot Address Unforeseen Harms

The constitution is written by humans at a specific point in time. It cannot anticipate harms that emerge later.

Example: In 2022, when Constitutional AI was developed, the harms from deepfakes were not as well understood as they are in 2024. The constitution might not have a principle about “avoid enabling deepfakes” because it wasn’t salient.

Consequence: As new harms emerge (new types of scams, new manipulation tactics, new misuse cases), the constitution needs to be updated. But updating the constitution requires retraining the model.

5. Assumes Sufficient Base Model Capability

Constitutional AI works only if the base model is already capable of:

Understanding the constitution principles
Critiquing its own outputs
Generating revised versions
Reasoning about harm and helpfulness

If the base model is too small or undertrained, it cannot perform these tasks.

Example: Constitutional AI probably doesn’t work well on a 1B model. The model doesn’t have enough reasoning capability to understand “honesty” or “harm prevention” in a sophisticated way.

Consequence: You can’t use Constitutional AI to train very small models from scratch. You need a reasonably capable base model (at least several billion parameters, well-trained).

6. Difficulty Measuring Effectiveness

It’s hard to evaluate whether Constitutional AI actually made the model safer.

Challenge: How do you measure “harmlessness”? You need a benchmark of harmful prompts and ground truth labels for what is and isn’t safe. But those labels require human judgment — the exact bottleneck Constitutional AI was supposed to avoid.

Practice: Anthropic used red-teaming (hiring adversaries to find failures) and human evaluation to measure effectiveness. But this is expensive.

Consequence: You can’t easily compare Constitutional AI to RLHF without doing expensive human evaluation studies.

7. Potential for Specification Gaming

There’s a risk that the model learns to “game” the constitution rather than truly follow its principles.

Example: If the constitution says “avoid illegal content,” the model might learn to refuse any request that sounds illegal, even if it’s actually legal. The model is optimizing for “looks safe” rather than “is safe.”

Consequence: The model might become overly cautious and less helpful. Or it might learn surface-level tricks that don’t actually address the underlying principles.

8. Limited Transparency on Actual Criteria

While the constitution is written in natural language and is readable, the model’s actual learned criteria are not transparent.

Problem: You write a constitution saying “avoid harm,” but the reward model learns a complicated function that only approximately captures this. The actual decision boundary of the reward model is a complex neural network, which is a black box.

Consequence: You have transparency about your intent (the constitution), but not about what the model actually learned. There’s a gap between the stated principles and the actual behavior.

Summary

Constitutional AI is a powerful technique that solves the RLHF bottleneck, but it’s not a silver bullet. It depends critically on constitution quality, it’s computationally expensive, and it cannot guarantee perfect safety. It’s best viewed as a tool that scales human judgment (via the constitution) across AI training, rather than a replacement for human oversight entirely.