KL Divergence Penalty
A regularization term in the RL objective that constrains the policy to stay close to the SFT baseline: β · KL[π_RL || π_SFT].
A regularization term in the RL objective that constrains the policy to stay close to the SFT baseline: β · KL[π_RL || π_SFT]. Prevents the RL policy from diverging too far in pursuit of reward, avoiding catastrophic forgetting of pretraining knowledge and reducing reward hacking. β ≈ 0.01–0.1 is typical.