Position: The Hidden Costs and Measurement Gaps of Reinforcement Learning with Verifiable Rewards
Tu, Aaron, Xuan, Weihao, Qi, Heli, Huang, Xu, Zeng, Qingcheng, Talaei, Shayan, Xiao, Yijia, Xia, Peng, Tang, Xiangru, Zhuang, Yuchen, Hu, Bing, Cao, Hanqun, Shi, Wenqi, Leng, Tianang, Yang, Rui, Chen, Yingjian, Wang, Ziqi, Li, Irene, Liu, Nan, Yao, Huaxiu, Li, Li Erran, Liu, Ge, Saberi, Amin, Yokoya, Naoto, Leskovec, Jure, Choi, Yejin, Wu, Fang
–arXiv.org Artificial Intelligence
Reinforcement learning with verifiable rewards (RL VR) is a practical and scalable approach to enhancing large language models in areas such as math, code, and other structured tasks. Two questions motivate this paper: how much of the reported gains survive under strictly parity-controlled evaluation, and whether RL VR is cost-free or exacts a measurable tax. We argue that progress is real, but gains are often overstated due to three forces--an RL VR tax, evaluation pitfalls, and data contamination. Using a partial-prompt contamination audit and matched-budget reproductions across base and RL models, we show that several headline gaps shrink or vanish under clean, parity-controlled evaluation. We then propose a tax-aware training and evaluation protocol that co-optimizes accuracy, grounding, and calibrated abstention and standardizes budgeting and provenance checks. Our position is constructive: RL VR is valuable and industry-ready; we advocate keeping its practical benefits while prioritizing reliability, safety, and measurement. Reinforcement learning with verifiable rewards (RL VR) has become a leading post-training route for improving large language models on math, code, and other structured tasks (Luong et al., 2024; Wen et al., 2025a). By optimizing against automatically computable signals--unit tests for programs, exact numeric or string matches for math, or retrieval-grounded checks for citations--RL VR promises a scalable, label-efficient path to better reasoning. Recent results are striking: across multiple domains, RL VR systems often post large gains on standard benchmarks. Moreover, Figure 2 shows a rise in RL VR-tagged papers alongside improvements on AIME-24/25 through 2024-H1 2025, underscoring both the field's momentum and the need to separate genuine reasoning gains from measurement and budgeting artifacts. Parity-controlled studies show that base models can narrow or erase RL VR gaps when given matched sampling budgets--consistent with smarter search rather than capability expansion (Y ue et al., 2025; Wu et al., 2025a).
arXiv.org Artificial Intelligence
Sep-29-2025
- Country:
- Asia (0.67)
- North America > United States
- California (0.67)
- Genre:
- Research Report
- Strength High (0.34)
- Experimental Study (0.34)
- Research Report
- Industry:
- Information Technology > Hardware (0.34)
- Leisure & Entertainment > Games
- Computer Games (0.34)
- Technology: