Paper 15

Further Reading: RLHF and InstructGPT

Further Reading: RLHF and InstructGPT

Dive deeper into alignment, RLHF variants, and the products built on this paper.


The Original Paper

Training Language Models to Follow Instructions with Human Feedback
Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelley, Emma Coleman, Brennan Zoph, Amanda Askell, Solal Picciotto, Ariel Herbert-Voss, Jeff Engstrom, Christopher Olah, Gretchen Krueger, Ryan Felsher, Timothy Telleen-Lawton, Tom Conerly, Tamera Lanham, Karina Nguyen, Todd Henighan, Saurav Kadavath, Nick Joseph, Tom Brown, Jack Clark, Dawn Song, Dario Amodei, Ilya Sutskever, Paul Christiano, Sam Altman
NeurIPS 2022 | March 2022

The foundational paper. Introduces the three-stage RLHF pipeline, demonstrates alignment beats scale, and introduces InstructGPT. Essential reading for understanding all modern aligned LLMs.


Foundational Work on Preference Learning

Learning from Human Preferences: The Original Idea

Deep Reinforcement Learning from Human Preferences
Paul Christiano, Jan Leike, Tom Brown, Miljan Maretic, Shane Legg, Dario Amodei
ICML 2017 | June 2017

First paper to use RL with human feedback for training. Predates this paper by 5 years but uses the same core insight: humans can provide preference comparisons, and RL can optimize based on them. Much smaller scale (Atari games), but the conceptual foundation.

Why read it: Understand the original vision and see how the idea scaled from games to language models.


Fine-Tuning Language Models from Human Preferences

Fine-Tuning Language Models from Human Preferences
Daniel M. Ziegler, Nisan Stoyanov, Tom B. Brown, Alec Radford, Dario Amodei, Chris Olah
arXiv 2019 | September 2019

Earlier application of preference learning to language models (GPT-2). Smaller scale but demonstrates the concept works for text. This paper (InstructGPT) scales it dramatically.

Why read it: See the precursor; understand how the technique evolved.


Key Follow-Ups: Improving RLHF

Constitutional AI: AI Feedback Instead of Human Feedback

Constitutional AI: Harmlessness from AI Feedback
Yuntao Bai, Andy Jones, Sam McCandlish, Nikolai Occupied, Jared Kaplan, Jack Clark, Tom Brown
Anthropic | December 2023

Key innovation: Instead of humans rating outputs, use an LLM (GPT-3) to evaluate responses against a set of constitutional principles.

Why relevant: Addresses the scalability problem of RLHF (human annotation is expensive). CAI is 100× cheaper and enables training of Claude.

Results: Claude emerges as competitive with ChatGPT using AI feedback instead of human feedback.


Direct Preference Optimization (DPO)

Direct Preference Optimization: Your Language Model is Secretly a Reward Model
Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, Chelsea Finn
Stanford | May 2023

Key innovation: Eliminate the separate reward model training stage. Train the policy directly on preference pairs using a closed-form objective.

Advantages:

  • Simpler (2 stages instead of 3)
  • More stable (fewer hyperparameters)
  • Matches or exceeds RLHF performance
  • Faster training

Why relevant: DPO is faster and simpler than RLHF while achieving comparable results. Many modern models use DPO instead of RLHF.


ORPO: Odds Ratio Preference Optimization

ORPO: Monolithic Preference Optimization without Reference Model
Hong Liu, Cahya Wirawan, Renren Jin, Bowen Zhang, Debing Zhang
arXiv | March 2024

Key innovation: Simplify DPO by removing the reference model entirely.

Why relevant: Even simpler than DPO, showing the direction of optimization: from complex pipelines (RLHF) to streamlined end-to-end methods.


On the Measurement and Control of Bias in Language Generation

On the Measurement and Mitigation of Unintended Bias in Text Generation
Su Lin Blodgett, Solon Barocas, Hal Daumé III, Suresh Venkatasubramanian
AIES 2020

Addresses bias in language models and measurement challenges in alignment.


Learning to Summarize with Human Feedback

Learning to summarize from human feedback
Nisan Stoyanov, Tom Brown, Bailey Pumperla, Ryan Lowe, Peter Welinder, Liane Lovitt, Liane Lovitt, Jack Clark, Sam McCandlish, Tom Henighan, Jared Kaplan, Chris Olah, Dario Amodei
OpenAI, NeurIPS 2020

Earlier OpenAI work applying preference learning to summarization (before InstructGPT). Shows the technique works for specific tasks.


Products and Deployments

ChatGPT: Bringing InstructGPT to Millions

ChatGPT (launched November 2022, 9 months after this paper) uses InstructGPT directly.

Resources:

Why relevant: See how the paper’s techniques became the world’s most popular AI product.


Claude: Anthropic’s RLHF Alternative

Claude uses Constitutional AI (RLAIF), a variant of RLHF that scales better.

Resources:

Why relevant: See how Constitutional AI improves on RLHF’s data cost problem.


GPT-4 with Improved RLHF

GPT-4 Technical Report
OpenAI | March 2023

Describes GPT-4’s training, including an improved RLHF pipeline. Shows iteration and refinement of the InstructGPT approach.


LLaMA-2-Chat: Open-Source RLHF

Llama 2: Open Foundation and Fine-Tuned Chat Models
Meta | July 2023

Demonstrates RLHF applied to open-source models. Includes details on data collection and alignment.

Why relevant: Shows RLHF is a general technique, not specific to OpenAI models.


Deeper Dives: Theory and Challenges

Reward Model Uncertainty and Distributional Shift

Reward Modeling for Faster Actual-Outcome Prediction in Reinforcement Learning
Daniel Dewey, et al.

Explores theoretical properties of reward models and distributional shift — a key challenge mentioned in this paper’s limitations.


Mechanistic Interpretability of Alignment

Interpretability in the Wild: Circuit Discovery, Reverse Engineering, and Distillation in the WILD
Various (MIRI, Anthropic, etc.)

Investigates how alignment objectives get encoded in neural networks.


Scaling Alignment

Beyond Preference Learning: Debate and Recursive Oversight

Scalable oversight of AI systems by humans using generative models
Paul Christiano, et al.

Explores how to extend preference learning to more complex forms of human feedback (debate, recursive oversight). Relevant for aligning more capable models.


Benchmarks and Evaluation

Towards Human-Level Performance on Automatic GLUE Score Prediction

Human-Level Performance in Large Language Models on Instruction-Following Tasks

Measures instruction-following quality (what InstructGPT improved).


TruthfulQA: Measuring Factuality in QA

TruthfulQA: Measuring How Models Mimic Human Falsehoods
Stephanie Lin, Jacob Hilton, Owain Evans
Center for AI Safety | September 2021

Benchmark for measuring truthfulness — one dimension of alignment.


Implementation and Tools

Hugging Face TRL: Text Reinforcement Learning

GitHub: huggingface/trl

Production-ready library for RLHF, PPO, DPO, etc. Handles all the engineering complexity.

Why use it: If you’re implementing RLHF, TRL does the heavy lifting.


DeepSpeed-Chat: Scalable RLHF

GitHub: microsoft/DeepSpeed

Microsoft’s framework for distributed RLHF training. Handles multi-GPU/multi-node scaling.


TensorFlow RL Suite

Alternatives to PyTorch for RL implementation.


Safety and Alignment Research

Center for AI Safety

CAIS Alignment Research

Ongoing research on scalable oversight, value learning, and alignment techniques.


Anthropic Research

Anthropic Blog

Extensive research on Constitutional AI, interpretability, and alignment scaling.


Open Questions and Future Directions

What Remains Unsolved

  1. Scalable oversight: How do we stay in control of superhuman models?
  2. Value learning: Can models learn complex human values beyond preferences?
  3. Adversarial robustness: Can aligned models be tricked into misalignment?
  4. Multi-objective alignment: How do we balance safety with capability?

Papers on These Questions

The Alignment Problem: Machine Learning and Human Values
Brian Christian | 2020 | Book

Comprehensive overview of alignment challenges and solutions.


AI Safety and Reproducibility: Case Studies and Suggestions
Liane Lovitt, et al.

Recent work on reproducibility in alignment research.


Quick Reference: RLHF Evolution (2017–2025)

2017 Jun:  Learning from Human Preferences (Christiano et al.)

2019 Sep:  Fine-Tuning Language Models from Human Preferences (Ziegler et al.)

2020 Nov:  Learning to Summarize from Human Feedback (Stoyanov et al.)

2022 Mar:  InstructGPT / RLHF (this paper) ← You are here

2022 Nov:  ChatGPT launches

2023 Mar:  GPT-4 with improved RLHF + Constitutional AI (Bai et al. concurrent)

2023 May:  DPO (Direct Preference Optimization) - simpler alternative

2024 Mar:  ORPO - even simpler

2025+:     Continued refinement and new approaches

Key Papers to Read in Order

  1. This paper: InstructGPT — Foundation
  2. Constitutional AI — Scalable feedback (RLAIF)
  3. DPO — Simpler pipeline
  4. ORPO — Further simplification
  5. ChatGPT Blog Post — Product deployment

Then, depending on interest:

  • Alignment: Read CAIS and Anthropic papers on scalable oversight
  • Safety: Read papers on adversarial robustness and value learning
  • Implementation: Work through HuggingFace TRL tutorials

Resources for Learning

Free Courses

Blogs and Tutorials

Textbooks


Navigation: ← Back to Paper 15