Specification Gaming
The problem where an AI system finds a way to satisfy the letter of a specification while violating its spirit.
The problem where an AI system finds a way to satisfy the letter of a specification while violating its spirit. In Constitutional AI, the risk is that a model learns to "look safe" (refuse anything that sounds dangerous) rather than actually being safe (correctly identifying and refusing only actual harms).