Problem Solving
Overthinking Reduction with Decoupled Rewards and Curriculum Data Scheduling
Jiang, Shuyang, Liao, Yusheng, Zhang, Ya, Wang, Yanfeng, Wang, Yu
While large reasoning models trained with critic-free reinforcement learning and verifiable rewards (RL VR) represent the state-of-the-art, their practical utility is hampered by "overthinking", a critical issue where models generate excessively long reasoning paths without any performance benefit. Existing solutions that penalize length often fail, inducing performance degradation due to a fundamental misalignment between trajectory-level rewards and token-level optimization. CS, built on our theoretical discovery of two previously unaddressed flaws in current length rewards: (1) the erroneous penalization of essential exploratory tokens and (2) the inadvertent rewarding of partial redundancy. Our framework's innovations include (i) a first-of-its-kind decoupled token-level reward mechanism that surgically distinguishes and penalizes redundant tokens, and (ii) a novel curriculum batch scheduling strategy to master the efficiency-efficacy equilibrium. CS can achieve a dramatic reduction in reasoning tokens by over 50% across seven benchmarks while simultaneously maintaining or even improving performance. It demonstrates conclusively that substantial gains in reasoning efficiency can be achieved without compromising a model's underlying reasoning power.Figure 1: Left: Two major flaws of prior practice apply sequence-level length reward without control of training data. CS improves pass@1 of base models while reducing 60% token costs compared to the base model across 7 benchmarks. Experimental details are presented in Appendix E.5. Recent large reasoning models (LRM; Guo et al. (2025); OpenAI (2025); Qwen (2025)) trained with critic-free reinforcement learning (RL) algorithms, such as GRPO (Shao et al., 2024), DAPO (Y u et al., 2025), and REINFORCE++ (Hu et al., 2025a), have demonstrated impressive reasoning capabilities through verifiable outcome rewards.
Atomic Thinking of LLMs: Decoupling and Exploring Mathematical Reasoning Abilities
Kuang, Jiayi, Huang, Haojing, Li, Yinghui, Liang, Xinnian, Xu, Zhikun, Li, Yangning, Tan, Xiaoyu, Qu, Chao, Zhang, Meishan, Shen, Ying, Yu, Philip S.
Large Language Models (LLMs) have demonstrated outstanding performance in mathematical reasoning capabilities. However, we argue that current large-scale reasoning models primarily rely on scaling up training datasets with diverse mathematical problems and long thinking chains, which raises questions about whether LLMs genuinely acquire mathematical concepts and reasoning principles or merely remember the training data. In contrast, humans tend to break down complex problems into multiple fundamental atomic capabilities. Inspired by this, we propose a new paradigm for evaluating mathematical atomic capabilities. Our work categorizes atomic abilities into two dimensions: (1) field-specific abilities across four major mathematical fields, algebra, geometry, analysis, and topology, and (2) logical abilities at different levels, including conceptual understanding, forward multi-step reasoning with formal math language, and counterexample-driven backward reasoning. We propose corresponding training and evaluation datasets for each atomic capability unit, and conduct extensive experiments about how different atomic capabilities influence others, to explore the strategies to elicit the required specific atomic capability. Evaluation and experimental results on advanced models show many interesting discoveries and inspirations about the different performances of models on various atomic capabilities and the interactions between atomic capabilities. Our findings highlight the importance of decoupling mathematical intelligence into atomic components, providing new insights into model cognition and guiding the development of training strategies toward a more efficient, transferable, and cognitively grounded paradigm of "atomic thinking".
From Faithfulness to Correctness: Generative Reward Models that Think Critically
Ma, Qiyao, Shi, Yunsheng, Tian, Hongtao, Wang, Chao, Chang, Weiming, Yao, Ting
Through reinforcement learning with verifiable rewards (RLVR), large language models have achieved substantial progress in domains with easily verifiable outcomes, such as mathematics and coding. However, when applied to more complex tasks like open-domain question answering, RLVR faces significant challenges due to the difficulty of verifying correctness. The nuanced and ambiguous nature of real-world knowledge makes it difficult to reliably evaluate correctness in these settings, necessitating further abilities that extend beyond mere logical consistency to encompass an understanding and assessment of both external and internal knowledge. Recent work has primarily focused on improving faithfulness, defined as semantic alignment with supporting documents, which can cause models to rely excessively on external sources and diminish their capacity for critical assessment. To address this, we propose the Thinking-supervised Reward Model (TRM), which incorporates sentence-level thinking supervision to endow reward models with critical thinking abilities. Given a query, answer, and supporting documents, TRM first assesses the faithfulness of each answer sentence to the supporting documents, and then applies a reasoning step to evaluate sentence-level correctness. By structuring reward modeling as a sequence of faithfulness, reasoning, and correctness evaluations, TRM encourages models to critically assess and leverage both external and internal knowledge. Experiments on reward signals demonstrate that TRM substantially improves the identification of incorrect sentences, and incorporating TRM into policy optimization leads to significant gains in both answer correctness and usefulness.
Synthesizing Robust Plans under Incomplete Domain Models
Most current planners assume complete domain models and focus on generating correct plans. Unfortunately, domain modeling is a laborious and error-prone task, thus real world agents have to plan with incomplete domain models. While domain experts cannot guarantee completeness, often they are able to circumscribe the incompleteness of the model by providing annotations as to which parts of the domain model may be incomplete. In such cases, the goal should be to synthesize plans that are robust with respect to any known incompleteness of the domain. In this paper, we first introduce annotations expressing the knowledge of the domain incompleteness and formalize the notion of plan robustness with respect to an incomplete domain model.
Learning to Search in Branch and Bound Algorithms
Branch-and-bound is a widely used method in combinatorial optimization, including mixed integer programming, structured prediction and MAP inference. While most work has been focused on developing problem-specific techniques, little is known about how to systematically design the node searching strategy on a branch-and-bound tree. We address the key challenge of learning an adaptive node searching order for any class of problem solvable by branch-and-bound. Our strategies are learned by imitation learning. We apply our algorithm to linear programming based branch-and-bound for solving mixed integer programs (MIP). We compare our method with one of the fastest open-source solvers, SCIP; and a very efficient commercial solver, Gurobi. We demonstrate that our approach achieves better solutions faster on four MIP libraries.
Training Agents Inside of Scalable World Models
Hafner, Danijar, Yan, Wilson, Lillicrap, Timothy
World models learn general knowledge from videos and simulate experience for training behaviors in imagination, offering a path towards intelligent agents. However, previous world models have been unable to accurately predict object interactions in complex environments. We introduce Dreamer 4, a scalable agent that learns to solve control tasks by reinforcement learning inside of a fast and accurate world model. In the complex video game Minecraft, the world model accurately predicts object interactions and game mechanics, outperforming previous world models by a large margin. The world model achieves real-time interactive inference on a single GPU through a shortcut forcing objective and an efficient transformer architecture. Moreover, the world model learns general action conditioning from only a small amount of data, allowing it to extract the majority of its knowledge from diverse unlabeled videos. We propose the challenge of obtaining diamonds in Minecraft from only offline data, aligning with practical applications such as robotics where learning from environment interaction can be unsafe and slow. This task requires choosing sequences of over 20,000 mouse and keyboard actions from raw pixels. By learning behaviors in imagination, Dreamer 4 is the first agent to obtain diamonds in Minecraft purely from offline data, without environment interaction. Our work provides a scalable recipe for imagination training, marking a step towards intelligent agents.
Dense associative memory on the Bures-Wasserstein space
Tankala, Chandan, Balasubramanian, Krishnakumar
Dense associative memories (DAMs) store and retrieve patterns via energy-functional fixed points, but existing models are limited to vector representations. We extend DAMs to probability distributions equipped with the 2-Wasserstein distance, focusing mainly on the Bures-Wasserstein class of Gaussian densities. Our framework defines a log-sum-exp energy over stored distributions and a retrieval dynamics aggregating optimal transport maps in a Gibbs-weighted manner. Stationary points correspond to self-consistent Wasserstein barycenters, generalizing classical DAM fixed points. We prove exponential storage capacity, provide quantitative retrieval guarantees under Wasserstein perturbations, and validate the model on synthetic and real-world distributional tasks. This work elevates associative memory from vectors to full distributions, bridging classical DAMs with modern generative modeling and enabling distributional storage and retrieval in memory-augmented learning.
Curriculum-Guided Reinforcement Learning for Synthesizing Gas-Efficient Financial Derivatives Contracts
Mridul, Maruf Ahmed, Seneviratne, Oshani
Smart contract-based automation of financial derivatives offers substantial efficiency gains, but its real-world adoption is constrained by the complexity of translating financial specifications into gas-efficient executable code. In particular, generating code that is both functionally correct and economically viable from high-level specifications, such as the Common Domain Model (CDM), remains a significant challenge. This paper introduces a Reinforcement Learning (RL) framework to generate functional and gas-optimized Solidity smart contracts directly from CDM specifications. We employ a Proximal Policy Optimization (PPO) agent that learns to select optimal code snippets from a pre-defined library. To manage the complex search space, a two-phase curriculum first trains the agent for functional correctness before shifting its focus to gas optimization. Our empirical results show the RL agent learns to generate contracts with significant gas savings, achieving cost reductions of up to 35.59% on unseen test data compared to unoptimized baselines. This work presents a viable methodology for the automated synthesis of reliable and economically sustainable smart contracts, bridging the gap between high-level financial agreements and efficient on-chain execution.
Can you SPLICE it together? A Human Curated Benchmark for Probing Visual Reasoning in VLMs
Ballout, Mohamad, Wilfred, Okajevo, Yaghoubi, Seyedalireza, Abdelmoneim, Nohayr Muhammad, Mayer, Julius, Bruni, Elia
In this work, we introduce SPLICE, a human-curated benchmark derived from the COIN instructional video dataset, designed to probe event-based reasoning across multiple dimensions: temporal, causal, spatial, contextual, and general knowledge. SPLICE includes 3,381 human-filtered videos spanning 12 categories and 180 sub-categories, such as sports, engineering, and housework. These videos are segmented into a total of 11,423 event clips. We evaluate both human participants and state-of-the-art vision-language models (VLMs) on the task of rearranging these clips into coherent event sequences to assess visual reasoning capabilities. Results reveal a significant gap: VLMs struggle to match human performance. While human-annotated textual descriptions improve model accuracy, they do not affect human performance, suggesting that models rely more on language priors than on visual understanding. Even with annotations, VLMs fall short of human-level reasoning, underscoring persistent challenges in visual reasoning. A deeper analysis across sub-categories shows that VLMs perform relatively better on videos where temporal and causal reasoning are dominant, compared to those where contextual and spatial reasoning are dominant. They also perform better on everyday tasks than on specialized ones.
ContextPRM: Leveraging Contextual Coherence for multi-domain Test-Time Scaling
Zhang, Haotian, Liu, Liu, Yu, Baosheng, Qiu, Jiayan, Xiao, Likang, Ren, Yanwei, Chen, Quan, Liu, Xianglong
Process reward models (PRMs) have demonstrated significant efficacy in enhancing the mathematical reasoning capabilities of large language models (LLMs) by leveraging test-time scaling (TTS). However, while most PRMs exhibit substantial gains in mathematical domains, the scarcity of domain-specific training data and knowledge-based learning patterns limits their generalization ability when faced with other domains. T o address this limitation, we shift the learning objective from verifying domain-specific knowledge to modeling domain-agnostic logical flow. Centering on contextual coherence between chain-of-thought (CoT) steps, our approach is realized through a novel data annotation and training framework, which enhances the model's generalization capabilities across diverse domains. F or instance, our resulting model, ContextPRM, achieves a notable 6.5% average accuracy improvement over the majority voting baseline via weighted majority voting across nine non-mathematical domains in MMLU-Pro, including law, history, and philosophy, significantly surpassing the 2.2% improvement from V ersaPRM and 0.5% gains from other mathematics-focused PRMs, demonstrating consistent performance across both mathematical and non-mathematical domains.