DisCO: Reinforcing Large Reasoning Models with Discriminative Constrained Optimization

Jun-12-2026, 05:04:37 GMT–Neural Information Processing Systems

The recent success and openness of DeepSeek-R1 have brought widespread attention to Group Relative Policy Optimization (GRPO) as a reinforcement learning method for large reasoning models (LRMs). In this work, we analyze the GRPO objective under a binary reward setting and reveal an inherent limitation of question-level difficulty bias arising from its group relative advantage function. We also identify a connection between GRPO and traditional discriminative methods in supervised learning.

artificial intelligence, machine learning, reinforcement learning, (10 more...)

Neural Information Processing Systems

Jun-12-2026, 05:04:37 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.58)