Value Function / Baseline
In RL, an estimate of expected future reward used to reduce gradient variance.
In RL, an estimate of expected future reward used to reduce gradient variance. In RLHF, a learned function V(prompt) estimates expected reward given a prompt, helping compute advantages. Reduces noise in policy gradient estimates, improving training stability.