Process Reward Model (PRM)

Appears in 2 papers

A machine-learning model trained to evaluate the quality of individual steps in a multi-step reasoning process.

As used in Paper 23 — Scaling LLM Test-Time Compute Optimally Can be More Effective than Scaling Model Parameters →

A machine-learning model trained to evaluate the quality of individual steps in a multi-step reasoning process. Unlike an Outcome Reward Model (ORM), which only scores the final answer, a PRM can identify which specific reasoning steps are correct, providing fine-grained feedback for training and selection.

As used in Paper 24 — rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking →

A model trained to score the quality of individual reasoning steps (as opposed to just the final answer). Scores steps on a scale [0, 1], where higher scores indicate more promising reasoning directions. Used to guide MCTS search.

Paper 23 — Scaling LLM Test-Time Compute Optimally Can be More Effective than Scaling Model Parameters → Paper 24 — rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking →

Appears in papers