Reinforcement Learning from Human Feedback (RLHF)

Appears in 1 paper

A three-stage training pipeline for aligning language models: (1) Supervised Fine-Tuning on human demonstrations, (2) training a Reward Model on human preference comparisons, (3) using Reinforcement Learning (PPO) to optimize the policy aga

As used in Paper 15 — Training Language Models to Follow Instructions with Human Feedback →

A three-stage training pipeline for aligning language models: (1) Supervised Fine-Tuning on human demonstrations, (2) training a Reward Model on human preference comparisons, (3) using Reinforcement Learning (PPO) to optimize the policy against the reward model. This paper is the seminal work on RLHF at scale.