Problem Solving
GigaWorld-0: World Models as Data Engine to Empower Embodied AI
GigaWorld Team, null, Ye, Angen, Wang, Boyuan, Ni, Chaojun, Huang, Guan, Zhao, Guosheng, Li, Haoyun, Zhu, Jiagang, Li, Kerui, Xu, Mengyuan, Deng, Qiuping, Wang, Siting, Qin, Wenkang, Chen, Xinze, Wang, Xiaofeng, Wang, Yankai, Cao, Yu, Chang, Yifan, Xu, Yuan, Ye, Yun, Wang, Yang, Zhou, Yukun, Zhang, Zhengyuan, Dong, Zhehao, Zhu, Zheng
World models are emerging as a foundational paradigm for scalable, data-efficient embodied AI. In this work, we present GigaWorld-0, a unified world model framework designed explicitly as a data engine for Vision-Language-Action (VLA) learning. GigaWorld-0 integrates two synergistic components: GigaWorld-0-Video, which leverages large-scale video generation to produce diverse, texture-rich, and temporally coherent embodied sequences under fine-grained control of appearance, camera viewpoint, and action semantics; and GigaWorld-0-3D, which combines 3D generative modeling, 3D Gaussian Splatting reconstruction, physically differentiable system identification, and executable motion planning to ensure geometric consistency and physical realism. Their joint optimization enables the scalable synthesis of embodied interaction data that is visually compelling, spatially coherent, physically plausible, and instruction-aligned. Training at scale is made feasible through our efficient GigaTrain framework, which exploits FP8-precision and sparse attention to drastically reduce memory and compute requirements. We conduct comprehensive evaluations showing that GigaWorld-0 generates high-quality, diverse, and controllable data across multiple dimensions. Critically, VLA model (e.g., GigaBrain-0) trained on GigaWorld-0-generated data achieve strong real-world performance, significantly improving generalization and task success on physical robots without any real-world interaction during training.
Look as You Think: Unifying Reasoning and Visual Evidence Attribution for Verifiable Document RAG via Reinforcement Learning
Liu, Shuochen, Luo, Pengfei, Zhang, Chao, Chen, Yuhao, Zhang, Haotian, Liu, Qi, Kou, Xin, Xu, Tong, Chen, Enhong
Aiming to identify precise evidence sources from visual documents, visual evidence attribution for visual document retrieval-augmented generation (VD-RAG) ensures reliable and verifiable predictions from vision-language models (VLMs) in multimodal question answering. Most existing methods adopt end-to-end training to facilitate intuitive answer verification. However, they lack fine-grained supervision and progressive traceability throughout the reasoning process. In this paper, we introduce the Chain-of-Evidence (CoE) paradigm for VD-RAG. CoE unifies Chain-of-Thought (CoT) reasoning and visual evidence attribution by grounding reference elements in reasoning steps to specific regions with bounding boxes and page indexes. To enable VLMs to generate such evidence-grounded reasoning, we propose Look As You Think (LAT), a reinforcement learning framework that trains models to produce verifiable reasoning paths with consistent attribution. During training, LAT evaluates the attribution consistency of each evidence region and provides rewards only when the CoE trajectory yields correct answers, encouraging process-level self-verification. Experiments on vanilla Qwen2.5-VL-7B-Instruct with Paper- and Wiki-VISA benchmarks show that LAT consistently improves the vanilla model in both single- and multi-image settings, yielding average gains of 8.23% in soft exact match (EM) and 47.0% in IoU@0.5. Meanwhile, LAT not only outperforms the supervised fine-tuning baseline, which is trained to directly produce answers with attribution, but also exhibits stronger generalization across domains.
Monitor-Generate-Verify (MGV): Formalising Metacognitive Theory for Language Model Reasoning
Test-time reasoning architectures such as those following the Generate-Verify paradigm, where a model iteratively refines or verifies its own generated outputs, prioritise generation and verification but exclude the monitoring processes that determine when and how reasoning should begin. This omission may contribute to the prefix dominance trap, in which models commit early to suboptimal reasoning paths and seldom recover, yielding roughly 20% accuracy loss. We address this architectural gap by proposing the Monitor-Generate-Verify (MGV) framework, a computational translation of Flavell's and Nelson and Narens' metacognitive theories that preserves their psychological detail. MGV extends the Generate-Verify paradigm by adding explicit monitoring that captures metacognitive experiences (from difficulty assessments to confidence judgements) before generation begins and refines future monitoring through verification feedback. Though we present no empirical validation, MGV provides a vocabulary for diagnosing component-level failures in reasoning systems, suggests specific architectural interventions for future designs, and identifies connections to resource-rational analysis that may ground its mechanisms in normative principles.
Measuring Chain-of-Thought Monitorability Through Faithfulness and Verbosity
Meek, Austin, Sprejer, Eitan, Arcuschin, Ivรกn, Brockmeier, Austin J., Basart, Steven
Chain-of-thought (CoT) outputs let us read a model's step-by-step reasoning. Since any long, serial reasoning process must pass through this textual trace, the quality of the CoT is a direct window into what the model is thinking. This visibility could help us spot unsafe or misaligned behavior (monitorability), but only if the CoT is transparent about its internal reasoning (faithfulness). Fully measuring faithfulness is difficult, so researchers often focus on examining the CoT in cases where the model changes its answer after adding a cue to the input. This proxy finds some instances of unfaithfulness but loses information when the model maintains its answer, and does not investigate aspects of reasoning not tied to the cue. We extend these results to a more holistic sense of monitorability by introducing verbosity: whether the CoT lists every factor needed to solve the task. We combine faithfulness and verbosity into a single monitorability score that shows how well the CoT serves as the model's external `working memory', a property that many safety schemes based on CoT monitoring depend on. We evaluate instruction-tuned and reasoning models on BBH, GPQA, and MMLU. Our results show that models can appear faithful yet remain hard to monitor when they leave out key factors, and that monitorability differs sharply across model families. We release our evaluation code using the Inspect library to support reproducible future work.
Limits of Generalization in RLVR: Two Case Studies in Mathematical Reasoning
Alam, Md Tanvirul, Rastogi, Nidhi
Mathematical reasoning is a central challenge for large language models (LLMs), requiring not only correct answers but also faithful reasoning processes. Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a promising approach for enhancing such capabilities; however, its ability to foster genuine reasoning remains unclear. We investigate RLVR on two combinatorial problems with fully verifiable solutions: \emph{Activity Scheduling} and the \emph{Longest Increasing Subsequence}, using carefully curated datasets with unique optima. Across multiple reward designs, we find that RLVR improves evaluation metrics but often by reinforcing superficial heuristics rather than acquiring new reasoning strategies. These findings highlight the limits of RLVR generalization, emphasizing the importance of benchmarks that disentangle genuine mathematical reasoning from shortcut exploitation and provide faithful measures of progress. Code available at https://github.com/xashru/rlvr-seq-generalization.
Beyond Token Length: Step Pruner for Efficient and Accurate Reasoning in Large Language Models
Wu, Canhui, Cao, Qiong, Li, Chang, Wang, Zhenfang, Xue, Chao, Fan, Yuwei, Xi, Wei, He, Xiaodong
Large Reasoning Models (LRMs) demonstrate strong performance on complex tasks but often suffer from excessive verbosity, known as "overthinking." Existing solutions via reinforcement learning (RL) typically penalize generated tokens to promote conciseness. However, these methods encounter two challenges: responses with fewer tokens do not always correspond to fewer reasoning steps, and models may develop hacking behavior in later stages of training by discarding reasoning steps to minimize token usage. In this work, we introduce \textbf{Step Pruner (SP)}, an RL framework that steers LRMs toward more efficient reasoning by favoring compact reasoning steps. Our step-aware reward function prioritizes correctness while imposing penalties for redundant steps, and withholds rewards for incorrect responses to prevent the reinforcement of erroneous reasoning. Moreover, we propose a dynamic stopping mechanism: when the model's output no longer shortens, training is halted to prevent hacking behavior caused by the merging of steps. Extensive experiments across four reasoning benchmarks demonstrate that SP achieves state-of-the-art accuracy while significantly reducing response length. For instance, on AIME24, SP reduces token usage by \textbf{69.7\%}.
Gemini Robotics 1.5: Pushing the Frontier of Generalist Robots with Advanced Embodied Reasoning, Thinking, and Motion Transfer
Gemini Robotics Team, null, Abdolmaleki, Abbas, Abeyruwan, Saminda, Ainslie, Joshua, Alayrac, Jean-Baptiste, Arenas, Montserrat Gonzalez, Balakrishna, Ashwin, Batchelor, Nathan, Bewley, Alex, Bingham, Jeff, Bloesch, Michael, Bousmalis, Konstantinos, Brakel, Philemon, Brohan, Anthony, Buschmann, Thomas, Byravan, Arunkumar, Cabi, Serkan, Caluwaerts, Ken, Casarini, Federico, Chan, Christine, Chang, Oscar, Chappellet-Volpini, London, Chen, Jose Enrique, Chen, Xi, Chiang, Hao-Tien Lewis, Choromanski, Krzysztof, Collister, Adrian, D'Ambrosio, David B., Dasari, Sudeep, Davchev, Todor, Dave, Meet Kirankumar, Devin, Coline, Di Palo, Norman, Ding, Tianli, Doersch, Carl, Dostmohamed, Adil, Du, Yilun, Dwibedi, Debidatta, Egambaram, Sathish Thoppay, Elabd, Michael, Erez, Tom, Fang, Xiaolin, Fantacci, Claudio, Fong, Cody, Frey, Erik, Fu, Chuyuan, Gao, Ruiqi, Giustina, Marissa, Gopalakrishnan, Keerthana, Graesser, Laura, Groth, Oliver, Gupta, Agrim, Hafner, Roland, Hansen, Steven, Hasenclever, Leonard, Haves, Sam, Heess, Nicolas, Hernaez, Brandon, Hofer, Alex, Hsu, Jasmine, Huang, Lu, Huang, Sandy H., Iscen, Atil, Jacob, Mithun George, Jain, Deepali, Jesmonth, Sally, Jindal, Abhishek, Julian, Ryan, Kalashnikov, Dmitry, Karagozler, M. Emre, Karp, Stefani, Kecman, Matija, Kew, J. Chase, Kim, Donnie, Kim, Frank, Kim, Junkyung, Kipf, Thomas, Kirmani, Sean, Konyushkova, Ksenia, Ku, Li Yang, Kuang, Yuheng, Lampe, Thomas, Laurens, Antoine, Le, Tuan Anh, Leal, Isabel, Lee, Alex X., Lee, Tsang-Wei Edward, Lever, Guy, Liang, Jacky, Lin, Li-Heng, Liu, Fangchen, Long, Shangbang, Lu, Caden, Maddineni, Sharath, Majumdar, Anirudha, Maninis, Kevis-Kokitsi, Marmon, Andrew, Martinez, Sergio, Michaely, Assaf Hurwitz, Milonopoulos, Niko, Moore, Joss, Moreno, Robert, Neunert, Michael, Nori, Francesco, Ortiz, Joy, Oslund, Kenneth, Parada, Carolina, Parisotto, Emilio, Paryag, Amaris, Pooley, Acorn, Power, Thomas, Quaglino, Alessio, Qureshi, Haroon, Raju, Rajkumar Vasudeva, Ran, Helen, Rao, Dushyant, Rao, Kanishka, Reid, Isaac, Rendleman, David, Reymann, Krista, Rivas, Miguel, Romano, Francesco, Rubanova, Yulia, Sampedro, Peter Pastor, Sanketi, Pannag R, Shah, Dhruv, Sharma, Mohit, Shea, Kathryn, Shridhar, Mohit, Shu, Charles, Sindhwani, Vikas, Singh, Sumeet, Soricut, Radu, Sterneck, Rachel, Storz, Ian, Surdulescu, Razvan, Tan, Jie, Tompson, Jonathan, Tunyasuvunakool, Saran, Varley, Jake, Vesom, Grace, Vezzani, Giulia, Villalonga, Maria Bauza, Vinyals, Oriol, Wagner, Renรฉ, Wahid, Ayzaan, Welker, Stefan, Wohlhart, Paul, Wu, Chengda, Wulfmeier, Markus, Xia, Fei, Xiao, Ted, Xie, Annie, Xie, Jinyu, Xu, Peng, Xu, Sichun, Xu, Ying, Xu, Zhuo, Yan, Jimmy, Yang, Sherry, Yang, Skye, Yang, Yuxiang, Yu, Hiu Hong, Yu, Wenhao, Yuan, Wentao, Yuan, Yuan, Zhang, Jingwei, Zhang, Tingnan, Zhang, Zhiyuan, Zhou, Allan, Zhou, Guangyao, Zhou, Yuxiang
General-purpose robots need a deep understanding of the physical world, advanced reasoning, and general and dexterous control. This report introduces the latest generation of the Gemini Robotics model family: Gemini Robotics 1.5, a multi-embodiment Vision-Language-Action (VLA) model, and Gemini Robotics-ER 1.5, a state-of-the-art Embodied Reasoning (ER) model. We are bringing together three major innovations. First, Gemini Robotics 1.5 features a novel architecture and a Motion Transfer (MT) mechanism, which enables it to learn from heterogeneous, multi-embodiment robot data and makes the VLA more general. Second, Gemini Robotics 1.5 interleaves actions with a multi-level internal reasoning process in natural language. This enables the robot to "think before acting" and notably improves its ability to decompose and execute complex, multi-step tasks, and also makes the robot's behavior more interpretable to the user. Third, Gemini Robotics-ER 1.5 establishes a new state-of-the-art for embodied reasoning, i.e., for reasoning capabilities that are critical for robots, such as visual and spatial understanding, task planning, and progress estimation. Together, this family of models takes us a step towards an era of physical agents-enabling robots to perceive, think and then act so they can solve complex multi-step tasks.
Remote Sensing-Oriented World Model
Lu, Yuxi, Wu, Biao, Li, Zhidong, Li, Kunqi, Huang, Chenya, Wang, Huacan, Lan, Qizhen, Chen, Ronghao, Chen, Ling, Liang, Bin
World models have shown potential in artificial intelligence by predicting and reasoning about world states beyond direct observations. However, existing approaches are predominantly evaluated in synthetic environments or constrained scene settings, limiting their validation in real-world contexts with broad spatial coverage and complex semantics. Meanwhile, remote sensing applications urgently require spatial reasoning capabilities for disaster response and urban planning. This paper bridges these gaps by introducing the first framework for world modeling in remote sensing. We formulate remote sensing world modeling as direction-conditioned spatial extrapolation, where models generate semantically consistent adjacent image tiles given a central observation and directional instruction. To enable rigorous evaluation, we develop RSWISE (Remote Sensing World-Image Spatial Evaluation), a benchmark containing 1,600 evaluation tasks across four scenarios: general, flood, urban, and rural. RSWISE combines visual fidelity assessment with instruction compliance evaluation using GPT-4o as a semantic judge, ensuring models genuinely perform spatial reasoning rather than simple replication. Afterwards, we present RemoteBAGEL, a unified multimodal model fine-tuned on remote sensing data for spatial extrapolation tasks. Extensive experiments demonstrate that RemoteBAGEL consistently outperforms state-of-the-art baselines on RSWISE.
AutoDrive-R$^2$: Incentivizing Reasoning and Self-Reflection Capacity for VLA Model in Autonomous Driving
Yuan, Zhenlong, Qian, Chengxuan, Tang, Jing, Chen, Rui, Song, Zijian, Sun, Lei, Chu, Xiangxiang, Cai, Yujun, Zhang, Dapeng, Li, Shuo
Vision-Language-Action (VLA) models in autonomous driving systems have recently demonstrated transformative potential by integrating multimodal perception with decision-making capabilities. However, the interpretability and coherence of the decision process and the plausibility of action sequences remain largely underexplored. To address these issues, we propose AutoDrive-R$^2$, a novel VLA framework that enhances both reasoning and self-reflection capabilities of autonomous driving systems through chain-of-thought (CoT) processing and reinforcement learning (RL). Specifically, we first propose an innovative CoT dataset named nuScenesR$^2$-6K for supervised fine-tuning, which effectively builds cognitive bridges between input information and output trajectories through a four-step logical chain with self-reflection for validation. Moreover, to maximize both reasoning and self-reflection during the RL stage, we further employ the Group Relative Policy Optimization (GRPO) algorithm within a physics-grounded reward framework that incorporates spatial alignment, vehicle dynamic, and temporal smoothness criteria to ensure reliable and realistic trajectory planning. Extensive evaluation results across both nuScenes and Waymo datasets demonstrates the state-of-the-art performance and robust generalization capacity of our proposed method.
A Survey on Improving Human Robot Collaboration through Vision-and-Language Navigation
Yakolli, Nivedan, Gautam, Avinash, Das, Abhijit, Qi, Yuankai, Shekhawat, Virendra Singh
Vision-and-Language Navigation (VLN) is a multi-modal, cooperative task requiring agents to interpret human instructions, navigate 3D environments, and communicate effectively under ambiguity. This paper presents a comprehensive review of recent VLN advancements in robotics and outlines promising directions to improve multi-robot coordination. Despite progress, current models struggle with bidirectional communication, ambiguity resolution, and collaborative decision-making in the multi-agent systems. We review approximately 200 relevant articles to provide an in-depth understanding of the current landscape. Through this survey, we aim to provide a thorough resource that inspires further research at the intersection of VLN and robotics. We advocate that the future VLN systems should support proactive clarification, real-time feedback, and contextual reasoning through advanced natural language understanding (NLU) techniques. Additionally, decentralized decision-making frameworks with dynamic role assignment are essential for scalable, efficient multi-robot collaboration. These innovations can significantly enhance human-robot interaction (HRI) and enable real-world deployment in domains such as healthcare, logistics, and disaster response.