Process Reward Model (PRM)
A machine-learning model trained to evaluate the quality of individual steps in a multi-step reasoning process.
A machine-learning model trained to evaluate the quality of individual steps in a multi-step reasoning process. Unlike an Outcome Reward Model (ORM), which only scores the final answer, a PRM can identify which specific reasoning steps are correct, providing fine-grained feedback for training and selection.
A model trained to score the quality of individual reasoning steps (as opposed to just the final answer). Scores steps on a scale [0, 1], where higher scores indicate more promising reasoning directions. Used to guide MCTS search.