Advantage / Advantage Estimation

Appears in 1 paper

In policy gradient RL, the difference between actual return and baseline: A = reward - V(prompt).

As used in Paper 15 — Training Language Models to Follow Instructions with Human Feedback →

In policy gradient RL, the difference between actual return and baseline: A = reward - V(prompt). Positive advantage indicates better-than-expected performance; negative indicates worse. Using advantage (rather than raw reward) reduces variance and improves learning.

Paper 15 — Training Language Models to Follow Instructions with Human Feedback →

Appears in papers