Specification Gaming

Appears in 1 paper

The problem where an AI system finds a way to satisfy the letter of a specification while violating its spirit.

As used in Paper 22 — Constitutional AI: Harmlessness from AI Feedback →

The problem where an AI system finds a way to satisfy the letter of a specification while violating its spirit. In Constitutional AI, the risk is that a model learns to "look safe" (refuse anything that sounds dangerous) rather than actually being safe (correctly identifying and refusing only actual harms).