PPO (Proximal Policy Optimization)

Appears in 1 paper

A stable reinforcement learning algorithm used in the RL stage.

As used in Paper 15 — Training Language Models to Follow Instructions with Human Feedback →

A stable reinforcement learning algorithm used in the RL stage. Updates policy using clipped gradients to prevent overshooting, avoiding training instability. PPO is simpler and more robust than earlier policy gradient methods like A3C or TRPO.