Human Rater Agreement / Inter-Rater Reliability
Measure of how often different human raters agree on which output is better.
Measure of how often different human raters agree on which output is better. In this paper, ~73% agreement. Lower agreement means ambiguity in preferences; higher agreement means clear preference signal. Disagreement is expected due to subjective taste variations.