Reward Model (RM)

Appears in 2 papers

A neural network trained in the second stage of RLHF to predict which of two responses humans prefer.

As used in Paper 15 — Training Language Models to Follow Instructions with Human Feedback →

A neural network trained in the second stage of RLHF to predict which of two responses humans prefer. Takes (prompt, response) pairs and outputs a scalar reward/logit. Trained on human preference comparisons using Bradley-Terry loss. Enables fast reward estimation without human raters in the loop during RL.

As used in Paper 22 — Constitutional AI: Harmlessness from AI Feedback →

A neural network trained to predict how "good" an AI output is. In RLAIF, the reward model learns from AI-generated preferences (which response follows the constitution better). The trained reward model is then used in PPO optimization.