Off-Policy vs. On-Policy RL
Off-policy: Learning from data generated by other policies (e.g., supervised data).
Off-policy: Learning from data generated by other policies (e.g., supervised data). On-policy: Learning from data generated by the current policy. RLHF is mostly on-policy (policy generates its own rollouts), with off-policy elements (SFT data from other sources).