Problem Solving
DMWM: Dual-Mind World Model with Long-Term Imagination
Wang, Lingyi, Shelim, Rashed, Saad, Walid, Ramakrishnan, Naren
Imagination in world models is crucial for enabling agents to learn long-horizon policy in a sample-efficient manner. Existing recurrent state-space model (RSSM)-based world models depend on single-step statistical inference to capture the environment dynamics, and, hence, they are unable to perform long-term imagination tasks due to the accumulation of prediction errors. Inspired by the dual-process theory of human cognition, we propose a novel dual-mind world model (DMWM) framework that integrates logical reasoning to enable imagination with logical consistency. DMWM is composed of two components: an RSSM-based System 1 (RSSM-S1) component that handles state transitions in an intuitive manner and a logic-integrated neural network-based System 2 (LINN-S2) component that guides the imagination process through hierarchical deep logical reasoning. The inter-system feedback mechanism is designed to ensure that the imagination process follows the logical rules of the real environment. The proposed framework is evaluated on benchmark tasks that require long-term planning from the DMControl suite. Extensive experimental results demonstrate that the proposed framework yields significant improvements in terms of logical coherence, trial efficiency, data efficiency and long-term imagination over the state-of-the-art world models.
When More is Less: Understanding Chain-of-Thought Length in LLMs
Wu, Yuyang, Wang, Yifei, Du, Tianqi, Jegelka, Stefanie, Wang, Yisen
Chain-of-thought (CoT) reasoning enhances the multi-step reasoning capabilities of large language models (LLMs) by breaking complex tasks into smaller, manageable sub-tasks. Researchers have been exploring ways to guide models to generate more complex CoT processes to improve the reasoning ability of LLMs, such as long CoT and the test-time scaling law. However, for most models and tasks, does an increase in CoT length consistently lead to improved reasoning accuracy? In this paper, we observe a nuanced relationship: as the number of reasoning steps increases, performance initially improves but eventually decreases. To understand this phenomenon, we provide a piece of evidence that longer reasoning processes are increasingly susceptible to noise. We theoretically prove the existence of an optimal CoT length and derive a scaling law for this optimal length based on model capability and task difficulty. Inspired by our theory, we conduct experiments on both synthetic and real world datasets and propose Length-filtered Vote to alleviate the effects of excessively long or short CoTs. Our findings highlight the critical need to calibrate CoT length to align with model capabilities and task demands, offering a principled framework for optimizing multi-step reasoning in LLMs.
CREDAL: Close Reading of Data Models
Fletcher, George, Nahurna, Olha, Prytula, Matvii, Stoyanovich, Julia
Data models are necessary for the birth of data and of any data-driven system. Indeed, every algorithm, every machine learning model, every statistical model, and every database has an underlying data model without which the system would not be usable. Hence, data models are excellent sites for interrogating the (material, social, political, ...) conditions giving rise to a data system. Towards this, drawing inspiration from literary criticism, we propose to closely read data models in the same spirit as we closely read literary artifacts. Close readings of data models reconnect us with, among other things, the materiality, the genealogies, the techne, the closed nature, and the design of technical systems. While recognizing from literary theory that there is no one correct way to read, it is nonetheless critical to have systematic guidance for those unfamiliar with close readings. This is especially true for those trained in the computing and data sciences, who too often are enculturated to set aside the socio-political aspects of data work. A systematic methodology for reading data models currently does not exist. To fill this gap, we present the CREDAL methodology for close readings of data models. We detail our iterative development process and present results of a qualitative evaluation of CREDAL demonstrating its usability, usefulness, and effectiveness in the critical study of data.
Interactive Sketchpad: An Interactive Multimodal System for Collaborative, Visual Problem-Solving
Chen, Steven-Shine, Lee, Jimin, Liang, Paul Pu
Humans have long relied on visual aids like sketches and diagrams to support reasoning and problem-solving. Visual tools, like auxiliary lines in geometry or graphs in calculus, are essential for understanding complex ideas. However, many tutoring systems remain text-based, providing feedback only through natural language. Leveraging recent advances in Large Multimodal Models (LMMs), this paper introduces Interactive Sketchpad, a tutoring system that combines language-based explanations with interactive visualizations to enhance learning. Built on a pre-trained LMM, Interactive Sketchpad is fine-tuned to provide step-by-step guidance in both text and visuals, enabling natural multimodal interaction with the student. Accurate and robust diagrams are generated by incorporating code execution into the reasoning process. User studies conducted on math problems such as geometry, calculus, and trigonometry demonstrate that Interactive Sketchpad leads to improved task comprehension, problem-solving accuracy, and engagement levels, highlighting its potential for transforming educational technologies.
Exploring the Limit of Outcome Reward for Learning Mathematical Reasoning
Lyu, Chengqi, Gao, Songyang, Gu, Yuzhe, Zhang, Wenwei, Gao, Jianfei, Liu, Kuikun, Wang, Ziyi, Li, Shuaibin, Zhao, Qian, Huang, Haian, Cao, Weihan, Liu, Jiangning, Liu, Hongwei, Liu, Junnan, Zhang, Songyang, Lin, Dahua, Chen, Kai
Reasoning abilities, especially those for solving complex math problems, are crucial components of general intelligence. Recent advances by proprietary companies, such as o-series models of OpenAI, have made remarkable progress on reasoning tasks. However, the complete technical details remain unrevealed, and the techniques that are believed certainly to be adopted are only reinforcement learning (RL) and the long chain of thoughts. This paper proposes a new RL framework, termed OREAL, to pursue the performance limit that can be achieved through \textbf{O}utcome \textbf{RE}w\textbf{A}rd-based reinforcement \textbf{L}earning for mathematical reasoning tasks, where only binary outcome rewards are easily accessible. We theoretically prove that behavior cloning on positive trajectories from best-of-N (BoN) sampling is sufficient to learn the KL-regularized optimal policy in binary feedback environments. This formulation further implies that the rewards of negative samples should be reshaped to ensure the gradient consistency between positive and negative samples. To alleviate the long-existing difficulties brought by sparse rewards in RL, which are even exacerbated by the partial correctness of the long chain of thought for reasoning tasks, we further apply a token-level reward model to sample important tokens in reasoning trajectories for learning. With OREAL, for the first time, a 7B model can obtain 94.0 pass@1 accuracy on MATH-500 through RL, being on par with 32B models. OREAL-32B also surpasses previous 32B models trained by distillation with 95.0 pass@1 accuracy on MATH-500. Our investigation also indicates the importance of initial policy models and training queries for RL. Code, models, and data will be released to benefit future research\footnote{https://github.com/InternLM/OREAL}.
VersaPRM: Multi-Domain Process Reward Model via Synthetic Reasoning Data
Zeng, Thomas, Zhang, Shuibai, Wu, Shutong, Classen, Christian, Chae, Daewon, Ewer, Ethan, Lee, Minjae, Kim, Heeju, Kang, Wonjun, Kunde, Jackson, Fan, Ying, Kim, Jungtaek, Koo, Hyung Il, Ramchandran, Kannan, Papailiopoulos, Dimitris, Lee, Kangwook
In particular, Outcome Reward Models (ORMs) are Process Reward Models (PRMs) have proven used to provide supervision based solely on the correctness effective at enhancing mathematical reasoning of the final outcome. However, ORMs fail to address errors for Large Language Models (LLMs) by leveraging in intermediate steps, limiting their effectiveness for increased inference-time computation. However, complex, multi-step reasoning tasks (Luo et al., 2024; Lightman they are predominantly trained on mathematical et al., 2024; Sun et al., 2024). Because ORMs suffer data and their generalizability to nonmathematical from this limitation, Process Reward Models (PRMs) have domains has not been rigorously been proposed to offer fine-grained, step-by-step feedback studied. In response, this work first shows that on the correctness of each reasoning step (Lightman et al., current PRMs have poor performance in other 2024; Uesato et al., 2022). PRMs have proven highly effective domains. To address this limitation, we introduce during inference, improving the reranking of generated VersaPRM, a multi-domain PRM trained solutions and guiding LLMs through search-based on synthetic reasoning data generated using our algorithms (Wan et al., 2024; Wang et al., 2024a).
From Foresight to Forethought: VLM-In-the-Loop Policy Steering via Latent Alignment
Wu, Yilin, Tian, Ran, Swamy, Gokul, Bajcsy, Andrea
While generative robot policies have demonstrated significant potential in learning complex, multimodal behaviors from demonstrations, they still exhibit diverse failures at deployment-time. Policy steering offers an elegant solution to reducing the chance of failure by using an external verifier to select from low-level actions proposed by an imperfect generative policy. Here, one might hope to use a Vision Language Model (VLM) as a verifier, leveraging its open-world reasoning capabilities. However, off-the-shelf VLMs struggle to understand the consequences of low-level robot actions as they are represented fundamentally differently than the text and images the VLM was trained on. In response, we propose FOREWARN, a novel framework to unlock the potential of VLMs as open-vocabulary verifiers for runtime policy steering. Our key idea is to decouple the VLM's burden of predicting action outcomes (foresight) from evaluation (forethought). For foresight, we leverage a latent world model to imagine future latent states given diverse low-level action plans. For forethought, we align the VLM with these predicted latent states to reason about the consequences of actions in its native representation--natural language--and effectively filter proposed plans. We validate our framework across diverse robotic manipulation tasks, demonstrating its ability to bridge representational gaps and provide robust, generalizable policy steering. Videos can be found on the project website: https://yilin-wu98.github.io/forewarn/.
The Minimal Search Space for Conditional Causal Bandits
Simoes, Francisco N. F. Q., Feigenbaum, Itai, Dastani, Mehdi, van Ommen, Thijs
Causal knowledge can be used to support decision-making problems. This has been recognized in the causal bandits literature, where a causal (multi-armed) bandit is characterized by a causal graphical model and a target variable. The arms are then interventions on the causal model, and rewards are samples of the target variable. Causal bandits were originally studied with a focus on hard interventions. We focus instead on cases where the arms are conditional interventions, which more accurately model many real-world decision-making problems by allowing the value of the intervened variable to be chosen based on the observed values of other variables. This paper presents a graphical characterization of the minimal set of nodes guaranteed to contain the optimal conditional intervention, which maximizes the expected reward. We then propose an efficient algorithm with a time complexity of $O(|V| + |E|)$ to identify this minimal set of nodes. We prove that the graphical characterization and the proposed algorithm are correct. Finally, we empirically demonstrate that our algorithm significantly prunes the search space and substantially accelerates convergence rates when integrated into standard multi-armed bandit algorithms.
Action-Free Reasoning for Policy Generalization
Clark, Jaden, Mirchandani, Suvir, Sadigh, Dorsa, Belkhale, Suneel
End-to-end imitation learning offers a promising approach for training robot policies. However, generalizing to new settings remains a significant challenge. Although large-scale robot demonstration datasets have shown potential for inducing generalization, they are resource-intensive to scale. In contrast, human video data is abundant and diverse, presenting an attractive alternative. Yet, these human-video datasets lack action labels, complicating their use in imitation learning. Existing methods attempt to extract grounded action representations (e.g., hand poses), but resulting policies struggle to bridge the embodiment gap between human and robot actions. We propose an alternative approach: leveraging language-based reasoning from human videos-essential for guiding robot actions-to train generalizable robot policies. Building on recent advances in reasoning-based policy architectures, we introduce Reasoning through Action-free Data (RAD). RAD learns from both robot demonstration data (with reasoning and action labels) and action-free human video data (with only reasoning labels). The robot data teaches the model to map reasoning to low-level actions, while the action-free data enhances reasoning capabilities. Additionally, we will release a new dataset of 3,377 human-hand demonstrations with reasoning annotations compatible with the Bridge V2 benchmark and aimed at facilitating future research on reasoning-driven robot learning. Our experiments show that RAD enables effective transfer across the embodiment gap, allowing robots to perform tasks seen only in action-free data. Furthermore, scaling up action-free reasoning data significantly improves policy performance and generalization to novel tasks. These results highlight the promise of reasoning-driven learning from action-free datasets for advancing generalizable robot control. Project page: https://rad-generalization.github.io
Cognitive Paradigms for Evaluating VLMs on Visual Reasoning Task
Vaishnav, Mohit, Tammet, Tanel
Advancing machine visual reasoning requires a deeper understanding of how Vision-Language Models (VLMs) process and interpret complex visual patterns. This work introduces a novel, cognitively-inspired evaluation framework to systematically analyze VLM reasoning on natural image-based Bongard Problems. We propose three structured paradigms -- Direct Visual Rule Learning, Deductive Rule Learning, and Componential Analysis -- designed to progressively enforce step-wise reasoning and disentangle the interplay between perception and reasoning. Our evaluation shows that advanced, closed-source VLMs (GPT-4o and Gemini 2.0) achieve near-superhuman performance, particularly when provided with high-quality image descriptions, while open-source models exhibit a significant performance bottleneck due to deficiencies in perception. An ablation study further confirms that perception, rather than reasoning, is the primary limiting factor, as open-source models apply extracted rules effectively when given accurate descriptions. These findings underscore the critical role of robust multimodal perception in enhancing generalizable visual reasoning and highlight the importance of structured, step-wise reasoning paradigms for advancing machine intelligence.