Stop Summation: Min-Form Credit Assignment Is All Process Reward Model Needs for Reasoning

Neural Information Processing Systems 

Process reward model (PRM) has been proven effective in test-time scaling of LLM on challenging reasoning tasks. However, the reward hacking induced by PRM hinders its successful applications in reinforcement fine-tuning. We find the primary cause of reward hacking induced by PRM is that: the canonical summation-form credit assignment in reinforcement learning (RL), i.e. cumulative gamma-decayed future rewards, causes the LLM to hack steps with high rewards. Therefore, to unleashing the power of PRM in training-time, we propose PURE: Process sUpervised Reinforcement lEarning. The core of PURE is the min-form credit assignment that defines the value function as the minimum future rewards.