Problem Solving
A Survey on Vision-Language-Action Models: An Action Tokenization Perspective
Zhong, Yifan, Bai, Fengshuo, Cai, Shaofei, Huang, Xuchuan, Chen, Zhang, Zhang, Xiaowei, Wang, Yuanfei, Guo, Shaoyang, Guan, Tianrui, Lui, Ka Nam, Qi, Zhiquan, Liang, Yitao, Chen, Yuanpei, Yang, Yaodong
The remarkable advancements of vision and language foundation models in multimodal understanding, reasoning, and generation has sparked growing efforts to extend such intelligence to the physical world, fueling the flourishing of vision-language-action (VLA) models. Despite seemingly diverse approaches, we observe that current VLA models can be unified under a single framework: vision and language inputs are processed by a series of VLA modules, producing a chain of \textit{action tokens} that progressively encode more grounded and actionable information, ultimately generating executable actions. We further determine that the primary design choice distinguishing VLA models lies in how action tokens are formulated, which can be categorized into language description, code, affordance, trajectory, goal state, latent representation, raw action, and reasoning. However, there remains a lack of comprehensive understanding regarding action tokens, significantly impeding effective VLA development and obscuring future directions. Therefore, this survey aims to categorize and interpret existing VLA research through the lens of action tokenization, distill the strengths and limitations of each token type, and identify areas for improvement. Through this systematic review and analysis, we offer a synthesized outlook on the broader evolution of VLA models, highlight underexplored yet promising directions, and contribute guidance for future research, hoping to bring the field closer to general-purpose intelligence.
Rethinking the Illusion of Thinking
Varela, Iรฑaki Dellibarda, Romero-Sorozabal, Pablo, Rocon, Eduardo, Cebrian, Manuel
Earlier this year, Apple ignited controversy by publishing "The Illusion of Thinking," prompting heated debate within the AI community. Critics seized upon the findings as conclusive evidence that Large Reasoning Models (LRMs) lack genuine reasoning capabilities, branding them as mere stochastic parrots. Meanwhile, defenders-spearheaded by Lawsen et al. (2025)-fired back, condemning the experimental setup as flawed and the conclusions overstated. We clarify this debate by replicating and refining two of the original study's most contentious benchmarks: Towers of Hanoi and River Crossing. By introducing incremental stepwise prompting and agentic collaborative dialogue, we show that previously reported failures solving the Towers of Hanoi were not purely result of output constraints, but also partly a result of cognition limitations: LRMs still stumble when complexity rises moderately (around 8 disks). Moreover, the River Crossing results initially heralded as catastrophic failures turn out to hinge upon testing unsolvable configurations. Once we limit tests strictly to solvable problems-LRMs effortlessly solve large instances involving over 100 agent pairs. Our findings ultimately defy simplistic narratives: today's LRMs are stochastic, RL-tuned searchers in a discrete state space we barely understand. Real progress in symbolic, long-horizon reasoning demands mapping that terrain through fine-grained ablations like those introduced here.
Adapting Rule Representation With Four-Parameter Beta Distribution for Learning Classifier Systems
Shiraishi, Hiroki, Hayamizu, Yohei, Hashiyama, Tomonori, Takadama, Keiki, Ishibuchi, Hisao, Nakata, Masaya
Rule representations significantly influence the search capabilities and decision boundaries within the search space of Learning Classifier Systems (LCSs), a family of rule-based machine learning systems that evolve interpretable models through evolutionary processes. However, it is very difficult to choose an appropriate rule representation for each problem. Additionally, some problems benefit from using different representations for different subspaces within the input space. Thus, an adaptive mechanism is needed to choose an appropriate rule representation for each rule in LCSs. This article introduces a flexible rule representation using a four-parameter beta distribution and integrates it into a fuzzy-style LCS. The four-parameter beta distribution can form various function shapes, and this flexibility enables our LCS to automatically select appropriate representations for different subspaces. Our rule representation can represent crisp/fuzzy decision boundaries in various boundary shapes, such as rectangles and bells, by controlling four parameters, compared to the standard representations such as trapezoidal ones. Leveraging this flexibility, our LCS is designed to adapt the appropriate rule representation for each subspace. Moreover, our LCS incorporates a generalization bias favoring crisp rules where feasible, enhancing model interpretability without compromising accuracy. Experimental results on real-world classification tasks show that our LCS achieves significantly superior test accuracy and produces more compact rule sets. Our implementation is available at https://github.com/YNU-NakataLab/Beta4-UCS. An extended abstract related to this work is available at https://doi.org/10.36227/techrxiv.174900805.59801248/v1.
Thinking Fast and Slow in Human and Machine Intelligence
Human intelligence has generally been studied by focusing on two primary levels: cognitive science, which examines the mind, and neuroscience, which focuses on the brain. Both approaches have influenced artificial intelligence (AI) research, leading to the development of various cognitive architectures with emergent behaviors.23 In this article, we propose an approach inspired by human cognition, specifically drawing on cognitive theories about human reasoning and decision making. We are inspired by the book Thinking, Fast and Slow by Daniel Kahneman,20 which categorizes human thought processes into two systems: System 1 (fast thinking) and System 2 (slow thinking).37 System 1, or "thinking fast," is responsible for intuitive, quick, and often unconscious decisions.
CaughtCheating: Is Your MLLM a Good Cheating Detective? Exploring the Boundary of Visual Perception and Reasoning
Li, Ming, Wang, Chenguang, Liang, Yijun, Wang, Xiyao, Zhou, Yuhang, Wu, Xiyang, Zhang, Yuqing, Zhang, Ruiyi, Zhou, Tianyi
Recent agentic Multi-Modal Large Language Models (MLLMs) such as GPT-o3 have achieved near-ceiling scores on various existing benchmarks, motivating a demand for more challenging test tasks. These MLLMs have been reported to excel in a few expert-level tasks for humans, e.g., GeoGuesser, reflecting their potential as a detective who can notice minuscule cues in an image and weave them into coherent, situational explanations, leading to a reliable answer. But can they match the performance of excellent human detectives? To answer this question, we investigate some hard scenarios where GPT-o3 can still handle, and find a common scenario where o3's performance drops to nearly zero, which we name CaughtCheating. It is inspired by the social media requests that ask others to detect suspicious clues from photos shared by the poster's partner. We conduct extensive experiments and analysis to understand why existing MLLMs lack sufficient capability to solve this kind of task. CaughtCheating provides a class of challenging visual perception and reasoning tasks with great value and practical usage. Success in these tasks paves the way for MLLMs to acquire human-level detective perception and reasoning capabilities.
Enhancing Reasoning Capabilities in SLMs with Reward Guided Dataset Distillation
The push to compress and impart the proficiency of Large Language Models (LLMs) into more deployable and efficient Small Language Models (SLMs) has benefited from improvements in knowledge distillation (KD) techniques. These techniques allow a smaller student model to learn from a more capable and larger teacher model's responses. However, distillation often revolves around the student model merely copying the teacher's in-distribution responses, limiting its generalisability. This limitation is amplified on reasoning tasks and can be computationally expensive. In this study, we propose AdvDistill, a reward-guided dataset distillation framework. We utilise multiple generations (responses) from a teacher for each prompt and assign rewards based on rule-based verifiers. These varying and normally distributed rewards serve as weights when training student models. Our methods and their subsequent behavioural analysis demonstrate a significant improvement in student model performance for mathematical and complex reasoning tasks, showcasing the efficacy and benefits of incorporating a rewarding mechanism in dataset distillation processes.
T2I-R1: Reinforcing Image Generation with Collaborative Semantic-level and Token-level CoT
Jiang, Dongzhi, Guo, Ziyu, Zhang, Renrui, Zong, Zhuofan, Li, Hao, Zhuo, Le, Yan, Shilin, Heng, Pheng-Ann, Li, Hongsheng
Recent advancements in large language models have demonstrated how chain-of-thought (CoT) and reinforcement learning (RL) can improve performance. However, applying such reasoning strategies to the visual generation domain remains largely unexplored. In this paper, we present T2I-R1, a novel reasoning-enhanced text-to-image generation model, powered by RL with a bi-level CoT reasoning process. Specifically, we identify two levels of CoT that can be utilized to enhance different stages of generation: (1) the semantic-level CoT for high-level planning of the prompt and (2) the token-level CoT for low-level pixel processing during patch-by-patch generation. To better coordinate these two levels of CoT, we introduce BiCoT-GRPO with an ensemble of generation rewards, which seamlessly optimizes both generation CoTs within the same training step. By applying our reasoning strategies to the baseline model, Janus-Pro, we achieve superior performance with 13% improvement on T2I-CompBench and 19% improvement on the WISE benchmark, even surpassing the state-of-the-art model FLUX.1. Code is available at: https://github.com/CaraJ7/T2I-R1
Learning Counterfactually Decoupled Attention for Open-World Model Attribution
Zheng, Yu, Gong, Boyang, Kong, Fanye, Duan, Yueqi, Yu, Bingyao, Zheng, Wenzhao, Chen, Lei, Lu, Jiwen, Zhou, Jie
In this paper, we propose a Counterfactually Decoupled Attention Learning (CDAL) method for open-world model attribution. Existing methods rely on handcrafted design of region partitioning or feature space, which could be confounded by the spurious statistical correlations and struggle with novel attacks in open-world scenarios. T o address this, CDAL explicitly models the causal relationships between the attentional visual traces and source model attribution, and counterfactually decouples the discriminative model-specific artifacts from confounding source biases for comparison. In this way, the resulting causal effect provides a quantification on the quality of learned attention maps, thus encouraging the network to capture essential generation patterns that generalize to unseen source models by maximizing the effect. Extensive experiments on existing open-world model attribution benchmarks show that with minimal computational overhead, our method consistently improves state-of-the-art models by large margins, particularly for unseen novel attacks.
ReasonBridge: Efficient Reasoning Transfer from Closed to Open-Source Language Models
Recent advancements in Large Language Models (LLMs) have revealed a significant performance gap between closed-source and open-source models, particularly in tasks requiring complex reasoning and precise instruction following. This paper introduces ReasonBridge, a methodology that efficiently transfers reasoning capabilities from powerful closed-source to open-source models through a novel hierarchical knowledge distillation framework. We develop a tailored dataset Reason1K with only 1,000 carefully curated reasoning traces emphasizing difficulty, diversity, and quality. These traces are filtered from across multiple domains using a structured multi-criteria selection algorithm. Our transfer learning approach incorporates: (1) a hierarchical distillation process capturing both strategic abstraction and tactical implementation patterns, (2) a sparse reasoning-focused adapter architecture requiring only 0.3% additional trainable parameters, and (3) a test-time compute scaling mechanism using guided inference interventions. Comprehensive evaluations demonstrate that ReasonBridge improves reasoning capabilities in open-source models by up to 23% on benchmark tasks, significantly narrowing the gap with closed-source models. Notably, the enhanced Qwen2.5-14B outperforms Claude-Sonnet3.5 on MATH500 and matches its performance on competition-level AIME problems. Our methodology generalizes effectively across diverse reasoning domains and model architectures, establishing a sample-efficient approach to reasoning enhancement for instruction following.
RoboScape: Physics-informed Embodied World Model
Shang, Yu, Zhang, Xin, Tang, Yinzhou, Jin, Lei, Gao, Chen, Wu, Wei, Li, Yong
World models have become indispensable tools for embodied intelligence, serving as powerful simulators capable of generating realistic robotic videos while addressing critical data scarcity challenges. However, current embodied world models exhibit limited physical awareness, particularly in modeling 3D geometry and motion dynamics, resulting in unrealistic video generation for contact-rich robotic scenarios. In this paper, we present RoboScape, a unified physics-informed world model that jointly learns RGB video generation and physics knowledge within an integrated framework. We introduce two key physics-informed joint training tasks: temporal depth prediction that enhances 3D geometric consistency in video rendering, and keypoint dynamics learning that implicitly encodes physical properties (e.g., object shape and material characteristics) while improving complex motion modeling. Extensive experiments demonstrate that RoboScape generates videos with superior visual fidelity and physical plausibility across diverse robotic scenarios. We further validate its practical utility through downstream applications including robotic policy training with generated data and policy evaluation. Our work provides new insights for building efficient physics-informed world models to advance embodied intelligence research. The code is available at: https://github.com/tsinghua-fib-lab/RoboScape.