Stop Summation: Min-Form Credit Assignment Is All Process Reward Model Needs for Reasoning

Jun-14-2026, 02:38:15 GMT–Neural Information Processing Systems

Process reward model (PRM) has been proven effective in test-time scaling of LLM on challenging reasoning tasks. However, the reward hacking induced by PRM hinders its successful applications in reinforcement fine-tuning. We find the primary cause of reward hacking induced by PRM is that: the canonical summation-form credit assignment in reinforcement learning (RL), i.e. cumulative gamma-decayed future rewards, causes the LLM to hack steps with high rewards. Therefore, to unleashing the power of PRM in training-time, we propose PURE: Process sUpervised Reinforcement lEarning. The core of PURE is the min-form credit assignment that defines the value function as the minimum future rewards.

large language model, machine learning, reinforcement learning, (14 more...)

Neural Information Processing Systems

Jun-14-2026, 02:38:15 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Large Language Model (0.49)
  - Machine Learning > Reinforcement Learning (0.49)