Problem Solving
Marco-o1: Towards Open Reasoning Models for Open-Ended Solutions
Zhao, Yu, Yin, Huifeng, Zeng, Bo, Wang, Hao, Shi, Tianqi, Lyu, Chenyang, Wang, Longyue, Luo, Weihua, Zhang, Kaifu
OpenAI recently introduces the groundbreaking o1 model [OpenAI, 2024, Zhong et al., 2024], renowned for its exceptional reasoning capabilities. This model has demonstrates outstanding performance on platforms such as AIME and CodeForces, surpassing other leading models. Inspired by this success, we aim to push the boundaries of LLMs even further, enhancing their reasoning abilities to tackle complex, real-world challenges. Inspired by OpenAI's o1, we aim to explore potential approaches to shed light on the currently unclear technical roadmap for large reasoning models (LRM). Marco-o1 leverages advanced techniques like CoT fine-tuning [Wei et al., 2022], MCTS [Wei et al., 2022, Feng et al., 2023, Silver et al., 2017], and Reasoning Action Strategies to enhance its reasoning power. As shown in Figure 2, by finetuning Qwen2-7B-Instruct [Yang et al., 2024] with a combination of the filtered Open-O1 CoT dataset [OpenO1 Team, 2024], Marco-o1 CoT dataset, and Marco-o1 Instruction dataset, Marco-o1 improves its handling of complex tasks.
On Expressive Power of Looped Transformers: Theoretical Analysis and Enhancement via Timestep Encoding
However, their expressive power for function approximation and approximation rate remains underexplored. In this paper, we establish approximation rates of Looped Transformers by defining the concept of the modulus of continuity for sequence-to-sequence functions. This reveals a limitation specific to the looped architecture. That is, the analysis prompts us to incorporate scaling parameters for each loop, conditioned on timestep encoding. Experimental results demonstrate that increasing the number of loops enhances performance, with further gains achieved through the timestep encoding architecture.
Evolution of Thought: Diverse and High-Quality Reasoning via Multi-Objective Optimization
Qi, Biqing, Qian, Zhouyi, Luo, Yiang, Gao, Junqi, Li, Dong, Zhang, Kaiyan, Zhou, Bowen
As multi-modal large language models (MLLMs) are increasingly applied to complex reasoning tasks, the diversity and quality of reasoning paths become crucial factors affecting their performance. Although current methods aim to enhance reasoning quality through path expansion, they often neglect the diversity of reasoning paths and effective information sharing, leading to local optima and inefficiency. To address these challenges, we propose Evolution of Thought (EoT), a multi-objective framework designed to improve reasoning by fostering both high-quality and diverse reasoning paths. Specifically, we introduce the Non-dominated Sorting Genetic Algorithm II for multi-objective optimization, utilizing crossover and mutation operators to promote greater diversity in reasoning solutions. Additionally, we propose a Condensation-Aggregation mechanism to cluster and eliminate redundant paths, facilitate improved information sharing among parent nodes, and ultimately enhance both the efficiency and quality of the reasoning process. Validation experiments on various vision-language and language reasoning tasks demonstrate that EoT achieves superior reasoning performance and efficiency compared to other competitive baselines. Our study provides a novel perspective on the design of heuristic reasoning frameworks for MLLMs.
Transformers Provably Solve Parity Efficiently with Chain of Thought
This work provides the first theoretical analysis of training transformers to solve complex problems by recursively generating intermediate states, analogous to fine-tuning for chain-of-thought (CoT) reasoning. We consider training a one-layer transformer to solve the fundamental k-parity problem, extending the work on RNNs by Wies et al. (2023). We establish three key results: (1) any finite-precision gradient-based algorithm, without intermediate supervision, requires substantial iterations to solve parity with finite samples. Our findings, supported by numerical experiments, show that task decomposition and stepwise reasoning naturally arise from optimizing transformers with CoT; moreover, self-consistency checking can improve multistep reasoning ability, aligning with empirical studies of CoT. Large language models (LLMs) based on the transformer architecture (Vaswani et al., 2017) have achieved astounding success across a variety of natural language processing and machine learning tasks (see e.g. These failures are particularly evident in tasks requiring multi-hop reasoning or compounded logical steps (Sakarvadia et al., 2024). A promising approach to overcome these limitations is chain-of-thought (CoT) reasoning, where the model is prompted or fine-tuned to solve complex tasks step-by-step by explicitly making intermediate reasoning steps to arrive at the desired answers (Wei et al., 2022; Kojima et al., 2022). Since its discovery, CoT reasoning has been shown to significantly enhance the problem-solving capabilities of LLMs while also increasing the interpretability and trustworthiness of the reasoning process, and has spawned numerous prompting techniques (Liu et al., 2023; Qiao et al., 2023) and applications for a variety of downstream tasks including common-sense reasoning, mathematical problem-solving, and symbolic or multi-modal reasoning; see e.g.
A Theoretical Survey on Foundation Models
Fu, Shi, Chen, Yuzhu, Wang, Yingjie, Tao, Dacheng
Understanding the inner mechanisms of black-box foundation models (FMs) is essential yet challenging in artificial intelligence and its applications. Over the last decade, the long-running focus has been on their explainability, leading to the development of post-hoc explainable methods to rationalize the specific decisions already made by black-box FMs. However, these explainable methods have certain limitations in terms of faithfulness and resource requirement. Consequently, a new class of interpretable methods should be considered to unveil the underlying mechanisms of FMs in an accurate, comprehensive, heuristic, and resource-light way. This survey aims to review those interpretable methods that comply with the aforementioned principles and have been successfully applied to FMs. These methods are deeply rooted in machine learning theory, covering the analysis of generalization performance, expressive capability, and dynamic behavior. They provide a thorough interpretation of the entire workflow of FMs, ranging from the inference capability and training dynamics to their ethical implications. Ultimately, drawing upon these interpretations, this review identifies the next frontier research directions for FMs.
AVID: Adapting Video Diffusion Models to World Models
Rigter, Marc, Gupta, Tarun, Hilmkil, Agrin, Ma, Chao
Large-scale generative models have achieved remarkable success in a number of domains. However, for sequential decision-making problems, such as robotics, action-labelled data is often scarce and therefore scaling-up foundation models for decision-making remains a challenge. A potential solution lies in leveraging widely-available unlabelled videos to train world models that simulate the consequences of actions. If the world model is accurate, it can be used to optimize decision-making in downstream tasks. Image-to-video diffusion models are already capable of generating highly realistic synthetic videos. However, these models are not action-conditioned, and the most powerful models are closedsource which means they cannot be finetuned. In this work, we propose to adapt pretrained video diffusion models to action-conditioned world models, without access to the parameters of the pretrained model. Our approach, AVID, trains an adapter on a small domain-specific dataset of action-labelled videos. AVID uses a learned mask to modify the intermediate outputs of the pretrained model and generate accurate action-conditioned videos. We evaluate AVID on video game and real-world robotics data, and show that it outperforms existing baselines for diffusion model adaptation. Our results demonstrate that if utilized correctly, pretrained video models have the potential to be powerful tools for embodied AI. Large generative models trained on web-scale data have driven rapid improvement in natural language processing (Brown, 2020; Touvron et al., 2023; Achiam et al., 2023), image generation (Rombach et al., 2022), and video generation (OpenAI, 2024).
Interpretable Video based Stress Detection with Self-Refine Chain-of-thought Reasoning
Stress detection is a critical area of research with significant implications for health monitoring and intervention systems. In this paper, we propose a novel interpretable approach for video-based stress detection, leveraging self-refine chain-of-thought reasoning to enhance both accuracy and transparency in decision-making processes. Our method focuses on extracting subtle behavioral and physiological cues from video sequences that indicate stress levels. By incorporating a chain-of-thought reasoning mechanism, the system refines its predictions iteratively, ensuring that the decision-making process can be traced and explained. The model also learns to self-refine through feedback loops, improving its reasoning capabilities over time. We evaluate our approach on several public and private datasets, demonstrating its superior performance in comparison to traditional video-based stress detection methods. Additionally, we provide comprehensive insights into the interpretability of the model's predictions, making the system highly valuable for applications in both healthcare and human-computer interaction domains.
PIANIST: Learning Partially Observable World Models with LLMs for Multi-Agent Decision Making
Light, Jonathan, Xing, Sixue, Liu, Yuanzhe, Chen, Weiqin, Cai, Min, Chen, Xiusi, Wang, Guanzhi, Cheng, Wei, Yue, Yisong, Hu, Ziniu
Effective extraction of the world knowledge in LLMs for complex decision-making tasks remains a challenge. We propose a framework PIANIST for decomposing the world model into seven intuitive components conducive to zero-shot LLM generation. Given only the natural language description of the game and how input observations are formatted, our method can generate a working world model for fast and efficient MCTS simulation. We show that our method works well on two different games that challenge the planning and decision making skills of the agent for both language and non-language based action taking, without any training on domain-specific training data or explicitly defined world model.
PSPO*: An Effective Process-supervised Policy Optimization for Reasoning Alignment
Li, Jiawei, Liang, Xinyue, Yang, Yizhe, Feng, Chong, Gao, Yang
Process supervision enhances the performance of large language models in reasoning tasks by providing feedback at each step of chain-of-thought reasoning. However, due to the lack of effective process supervision methods, even advanced large language models are prone to logical errors and redundant reasoning. We claim that the effectiveness of process supervision significantly depends on both the accuracy and the length of reasoning chains. Moreover, we identify that these factors exhibit a nonlinear relationship with the overall reward score of the reasoning process. Inspired by these insights, we propose a novel process supervision paradigm, PSPO*, which systematically outlines the workflow from reward model training to policy optimization, and highlights the importance of nonlinear rewards in process supervision. Based on PSPO*, we develop the PSPO-WRS, which considers the number of reasoning steps in determining reward scores and utilizes an adjusted Weibull distribution for nonlinear reward shaping. Experimental results on six mathematical reasoning datasets demonstrate that PSPO-WRS consistently outperforms current mainstream models.
VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame Selection
Han, Songhao, Huang, Wei, Shi, Hairong, Zhuo, Le, Su, Xiu, Zhang, Shifeng, Zhou, Xu, Qi, Xiaojuan, Liao, Yue, Liu, Si
The advancement of Large Vision Language Models (LVLMs) has significantly improved multimodal understanding, yet challenges remain in video reasoning tasks due to the scarcity of high-quality, large-scale datasets. Existing video question-answering (VideoQA) datasets often rely on costly manual annotations with insufficient granularity or automatic construction methods with redundant frame-by-frame analysis, limiting their scalability and effectiveness for complex reasoning. To address these challenges, we introduce VideoEspresso, a novel dataset that features VideoQA pairs preserving essential spatial details and temporal coherence, along with multimodal annotations of intermediate reasoning steps. Our construction pipeline employs a semantic-aware method to reduce redundancy, followed by generating QA pairs using GPT-4o. We further develop video Chain-of-Thought (CoT) annotations to enrich reasoning processes, guiding GPT-4o in extracting logical relationships from QA pairs and video content. To exploit the potential of high-quality VideoQA pairs, we propose a Hybrid LVLMs Collaboration framework, featuring a Frame Selector and a two-stage instruction fine-tuned reasoning LVLM. This framework adaptively selects core frames and performs CoT reasoning using multimodal evidence. Evaluated on our proposed benchmark with 14 tasks against 9 popular LVLMs, our method outperforms existing baselines on most tasks, demonstrating superior video reasoning capabilities. Our code and dataset will be released at: https://github.com/hshjerry/VideoEspresso