Overthinking Reduction with Decoupled Rewards and Curriculum Data Scheduling

Jiang, Shuyang, Liao, Yusheng, Zhang, Ya, Wang, Yanfeng, Wang, Yu

Oct-1-2025–arXiv.org Artificial Intelligence

While large reasoning models trained with critic-free reinforcement learning and verifiable rewards (RL VR) represent the state-of-the-art, their practical utility is hampered by "overthinking", a critical issue where models generate excessively long reasoning paths without any performance benefit. Existing solutions that penalize length often fail, inducing performance degradation due to a fundamental misalignment between trajectory-level rewards and token-level optimization. CS, built on our theoretical discovery of two previously unaddressed flaws in current length rewards: (1) the erroneous penalization of essential exploratory tokens and (2) the inadvertent rewarding of partial redundancy. Our framework's innovations include (i) a first-of-its-kind decoupled token-level reward mechanism that surgically distinguishes and penalizes redundant tokens, and (ii) a novel curriculum batch scheduling strategy to master the efficiency-efficacy equilibrium. CS can achieve a dramatic reduction in reasoning tokens by over 50% across seven benchmarks while simultaneously maintaining or even improving performance. It demonstrates conclusively that substantial gains in reasoning efficiency can be achieved without compromising a model's underlying reasoning power.Figure 1: Left: Two major flaws of prior practice apply sequence-level length reward without control of training data. CS improves pass@1 of base models while reducing 60% token costs compared to the base model across 7 benchmarks. Experimental details are presented in Appendix E.5. Recent large reasoning models (LRM; Guo et al. (2025); OpenAI (2025); Qwen (2025)) trained with critic-free reinforcement learning (RL) algorithms, such as GRPO (Shao et al., 2024), DAPO (Y u et al., 2025), and REINFORCE++ (Hu et al., 2025a), have demonstrated impressive reasoning capabilities through verifiable outcome rewards.

large language model, machine learning, reinforcement learning, (20 more...)

arXiv.org Artificial Intelligence

Oct-1-2025

arXiv.org PDF

Add feedback

Country:
- Asia (0.28)
- Europe > Austria (0.28)

Genre:
- Research Report > New Finding (0.46)

Technology:
- Information Technology > Artificial Intelligence
  - Cognitive Science > Problem Solving (0.66)
  - Natural Language
    - Large Language Model (0.68)
    - Chatbot (0.49)
  - Machine Learning
    - Reinforcement Learning (0.75)
    - Neural Networks > Deep Learning (0.49)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found