Advantage / Advantage Estimation
In policy gradient RL, the difference between actual return and baseline: A = reward - V(prompt).
In policy gradient RL, the difference between actual return and baseline: A = reward - V(prompt). Positive advantage indicates better-than-expected performance; negative indicates worse. Using advantage (rather than raw reward) reduces variance and improves learning.