Generalization / Generalizing to New Domains
Whether a trained reward model (or policy) performs well on new, unseen tasks or domains.
Whether a trained reward model (or policy) performs well on new, unseen tasks or domains. RLHF generalizes better than pure SFT because the RM learns general principles of preference (clarity, relevance, safety) that transfer across domains.