DisCO: Reinforcing Large Reasoning Models with Discriminative Constrained Optimization
–Neural Information Processing Systems
The recent success and openness of DeepSeek-R1 have brought widespread attention to Group Relative Policy Optimization (GRPO) as a reinforcement learning method for large reasoning models (LRMs). In this work, we analyze the GRPO objective under a binary reward setting and reveal an inherent limitation of question-level difficulty bias arising from its group relative advantage function. We also identify a connection between GRPO and traditional discriminative methods in supervised learning.
Neural Information Processing Systems
Jun-12-2026, 05:04:37 GMT
- Technology: