AITopics | Spatial Reasoning

Collaborating Authors

Spatial Reasoning

News Overviews Instructional Materials AI-Alerts Classics

Evo-0: Vision-Language-Action Model with Implicit Spatial Understanding

Lin, Tao, Li, Gen, Zhong, Yilei, Zou, Yanwen, Du, Yuxin, Liu, Jiting, Gu, Encheng, Zhao, Bo

arXiv.org Artificial IntelligenceNov-25-2025

Vision-Language-Action (VLA) models have emerged as a promising framework for enabling generalist robots capable of perceiving, reasoning, and acting in the real world. These models usually build upon pretrained Vision-Language Models (VLMs), which excel at semantic understanding due to large-scale image and text pretraining. However, existing VLMs typically lack precise spatial understanding capabilities, as they are primarily tuned on 2D image-text pairs without 3D supervision. To address this limitation, recent approaches have incorporated explicit 3D inputs such as point clouds or depth maps, but this necessitates additional depth sensors or pre-trained depth estimation models, which may yield defective results. In contrast, our work introduces a plug-and-play module that implicitly incorporates 3D geometry features into VLA models by leveraging an off-the-shelf visual geometry foundation model. This integration provides the model with depth-aware visual representations, improving its ability to understand the geometric structure of the scene and the spatial relationships among objects from RGB images alone. We evaluate our method on a set of spatially challenging tasks in both simulation and the real world. Extensive evaluations show that our method significantly improves the performance of state-of-the-art VLA models across diverse scenarios.

arxiv preprint arxiv, large language model, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2507.00416

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
(2 more...)

Add feedback

Spatial Knowledge Graph-Guided Multimodal Synthesis

Xue, Yida, Bi, Zhen, Yang, Jinnan, Lou, Jungang, Chen, Kehai, Zhang, Min, Chen, Huajun, Zhang, Ningyu

arXiv.org Artificial IntelligenceNov-25-2025

Recent advances in Multimodal Large Language Models (MLLMs) have significantly enhanced their capabilities; however, their spatial perception abilities remain a notable limitation. To address this challenge, multimodal data synthesis offers a promising solution. Yet, ensuring that synthesized data adhere to spatial common sense is a non-trivial task. Our approach addresses this critical gap by providing a systematic framework for generating spatially coherent data. In this work, we introduce SKG2DATA, a novel multimodal synthesis approach guided by spatial knowledge graphs, grounded in the concept of knowledge-to-data generation. SKG2DATA employs an automated pipeline for constructing Spatial Knowledge Graph (SKG) that effectively captures human-like spatial cognition, including directional and distance relationships. These structured representations then serve as precise guidance for our integrated synthesis pipeline, where a diffusion model generates spatially-consistent images while a MLLM produces corresponding textual descriptions. The automated construction of SKG enables scalable generation of diverse yet realistic spatial configurations, overcoming the limitations of manual data collection and annotation. Extensive experiments demonstrate that data synthesized from diverse types of spatial knowledge, including direction and distance, enhance the spatial perception and reasoning abilities of MLLMs markedly, albeit with a slight cost to their general capabilities. We hope that the idea of knowledge-based data synthesis can advance the development of spatial intelligence. Code is available at https://github.com/zjunlp/Knowledge2Data.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2505.22633

Country:

North America > United States (1.00)
Asia > China > Zhejiang Province (0.28)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Spatial Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

MoveGPT: Scaling Mobility Foundation Models with Spatially-Aware Mixture of Experts

Han, Chonghua, Yuan, Yuan, Ding, Jingtao, Feng, Jie, Meng, Fanjin, Li, Yong

arXiv.org Artificial IntelligenceNov-25-2025

The success of foundation models in language has inspired a new wave of general-purpose models for human mobility. However, existing approaches struggle to scale effectively due to two fundamental limitations: a failure to use meaningful basic units to represent movement, and an inability to capture the vast diversity of patterns found in large-scale data. In this work, we develop MoveGPT, a large-scale foundation model specifically architected to overcome these barriers. MoveGPT is built upon two key innovations: (1) a unified location encoder that maps geographically disjoint locations into a shared semantic space, enabling pre-training on a global scale; and (2) a Spatially-Aware Mixture-of-Experts Transformer that develops specialized experts to efficiently capture diverse mobility patterns. Pre-trained on billion-scale datasets, MoveGPT establishes a new state-of-the-art across a wide range of downstream tasks, achieving performance gains of up to 35% on average. It also demonstrates strong generalization capabilities to unseen cities. Crucially, our work provides empirical evidence of scaling ability in human mobility, validating a clear path toward building increasingly capable foundation models in this domain.

large language model, machine learning, trajectory, (20 more...)

arXiv.org Artificial Intelligence

2505.1867

Country: North America > United States > California (0.14)

Genre: Research Report > New Finding (0.46)

Industry:

Transportation (0.67)
Consumer Products & Services > Travel (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Data Science (0.95)
Information Technology > Artificial Intelligence > Representation & Reasoning > Spatial Reasoning (0.68)

Add feedback

IndustryNav: Exploring Spatial Reasoning of Embodied Agents in Dynamic Industrial Navigation

Li, Yifan, Li, Lichi, Dao, Anh, Zhou, Xinyu, Qiao, Yicheng, Mai, Zheda, Lee, Daeun, Chen, Zichen, Tan, Zhen, Bansal, Mohit, Kong, Yu

arXiv.org Artificial IntelligenceNov-24-2025

While Visual Large Language Models (VLLMs) show great promise as embodied agents, they continue to face substantial challenges in spatial reasoning. Existing embodied benchmarks largely focus on passive, static household environments and evaluate only isolated capabilities, failing to capture holistic performance in dynamic, real-world complexity. To fill this gap, we present IndustryNav, the first dynamic industrial navigation benchmark for active spatial reasoning. IndustryNav leverages 12 manually created, high-fidelity Unity warehouse scenarios featuring dynamic objects and human movement. Our evaluation employs a PointGoal navigation pipeline that effectively combines egocentric vision with global odometry to assess holistic local-global planning. Crucially, we introduce the "collision rate" and "warning rate" metrics to measure safety-oriented behaviors and distance estimation. A comprehensive study of nine state-of-the-art VLLMs (including models such as GPT-5-mini, Claude-4.5, and Gemini-2.5) reveals that closed-source models maintain a consistent advantage; however, all agents exhibit notable deficiencies in robust path planning, collision avoidance and active exploration. This highlights a critical need for embodied research to move beyond passive perception and toward tasks that demand stable planning, active exploration, and safe behavior in dynamic, real-world environment.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2511.17384

Country:

North America > United States > California (0.46)
North America > United States > Michigan (0.28)

Genre: Research Report (0.64)

Industry: Transportation (0.88)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Spatial Reasoning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

WorldGen: From Text to Traversable and Interactive 3D Worlds

Wang, Dilin, Jung, Hyunyoung, Monnier, Tom, Sohn, Kihyuk, Zou, Chuhang, Xiang, Xiaoyu, Yeh, Yu-Ying, Liu, Di, Huang, Zixuan, Nguyen-Phuoc, Thu, Fan, Yuchen, Oprea, Sergiu, Wang, Ziyan, Shapovalov, Roman, Sarafianos, Nikolaos, Groueix, Thibault, Toisoul, Antoine, Dhar, Prithviraj, Chu, Xiao, Chen, Minghao, Park, Geon Yeong, Gupta, Mahima, Azziz, Yassir, Ranjan, Rakesh, Vedaldi, Andrea

arXiv.org Artificial IntelligenceNov-24-2025

We introduce WorldGen, a system that enables the automatic creation of large-scale, interactive 3D worlds directly from text prompts. Our approach transforms natural language descriptions into traversable, fully textured environments that can be immediately explored or edited within standard game engines. By combining LLM-driven scene layout reasoning, procedural generation, diffusion-based 3D generation, and object-aware scene decomposition, WorldGen bridges the gap between creative intent and functional virtual spaces, allowing creators to design coherent, navigable worlds without manual modeling or specialized 3D expertise. The system is fully modular and supports fine-grained control over layout, scale, and style, producing worlds that are geometrically consistent, visually rich, and efficient to render in real time. This work represents a step towards accessible, generative world-building at scale, advancing the frontier of 3D generative AI for applications in gaming, simulation, and immersive social environments.

machine learning, natural language, proc, (22 more...)

arXiv.org Artificial Intelligence

2511.16825

Country:

Asia (0.45)
North America > United States (0.28)

Genre: Research Report (0.82)

Industry: Leisure & Entertainment > Games > Computer Games (0.48)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Spatial Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)

Add feedback

Searching in Space and Time: Unified Memory-Action Loops for Open-World Object Retrieval

Chen, Taijing, Kumar, Sateesh, Xu, Junhong, Pavlakos, Georgios, Biswas, Joydeep, Martín-Martín, Roberto

arXiv.org Artificial IntelligenceNov-24-2025

Service robots must retrieve objects in dynamic, open-world settings where requests may reference attributes ("the red mug"), spatial context ("the mug on the table"), or past states ("the mug that was here yesterday"). Existing approaches capture only parts of this problem: scene graphs capture spatial relations but ignore temporal grounding, temporal reasoning methods model dynamics but do not support embodied interaction, and dynamic scene graphs handle both but remain closed-world with fixed vocabularies. We present STAR (SpatioTemporal Active Retrieval), a framework that unifies memory queries and embodied actions within a single decision loop. STAR leverages non-parametric long-term memory and a working memory to support efficient recall, and uses a vision-language model to select either temporal or spatial actions at each step. We introduce STARBench, a benchmark of spatiotemporal object search tasks across simulated and real environments. Experiments in STARBench and on a Tiago robot show that STAR consistently outperforms scene-graph and memory-only baselines, demonstrating the benefits of treating search in time and search in space as a unified problem. For more information: https://amrl.cs.utexas.edu/STAR.

artificial intelligence, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2511.14004

Country: North America > United States > Texas (0.28)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Spatial Reasoning (0.88)

Add feedback

Spatial Reasoning in Multimodal Large Language Models: A Survey of Tasks, Benchmarks and Methods

Liu, Weichen, Xue, Qiyao, Wang, Haoming, Yin, Xiangyu, Yang, Boyuan, Gao, Wei

arXiv.org Artificial IntelligenceNov-21-2025

Spatial reasoning, which requires ability to perceive and manipulate spatial relationships in the 3D world, is a fundamental aspect of human intelligence, yet remains a persistent challenge for Multimodal large language models (MLLMs). While existing surveys often categorize recent progress based on input modality (e.g., text, image, video, or 3D), we argue that spatial ability is not solely determined by the input format. Instead, our survey introduces a taxonomy that organizes spatial intelligence from cognitive aspect and divides tasks in terms of reasoning complexity, linking them to several cognitive functions. We map existing benchmarks across text-only, vision-language, and embodied settings onto this taxonomy, and review evaluation metrics and methodologies for assessing spatial reasoning ability. This cognitive perspective enables more principled cross-task comparisons and reveals critical gaps between current model capabilities and human-like reasoning. In addition, we analyze methods for improving spatial ability, spanning both training-based and reasoning-based approaches. This dual-perspective analysis clarifies their respective strengths, uncovers complementary mechanisms. By surveying tasks, benchmarks, and recent advances, we aim to provide new researchers with a comprehensive understanding of the field and actionable directions for future research.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2511.15722

Country:

North America (0.28)
Asia (0.28)

Genre: Overview (1.00)

Industry: Health & Medicine > Therapeutic Area > Neurology (0.66)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Spatial Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Cognitive Science (1.00)

Add feedback

STAMP: Spatial-Temporal Adapter with Multi-Head Pooling

Shook, Brad, Turner, Abby, Chen, Jieshi, Wiliński, Michał, Goswami, Mononito, Elmer, Jonathan, Dubrawski, Artur

arXiv.org Artificial IntelligenceNov-21-2025

Time series foundation models (TSFMs) pretrained on data from multiple domains have shown strong performance on diverse modeling tasks. Various efforts have been made to develop foundation models specific to electroencephalography (EEG) data, which records brain electrical activity as time series. However, no comparative analysis of EEG-specific foundation models (EEGFMs) versus general TSFMs has been performed on EEG-specific tasks. We introduce a novel Spatial-Temporal Adapter with Multi-Head Pooling (STAMP), which leverages univariate embeddings produced by a general TSFM, implicitly models spatial-temporal characteristics of EEG data, and achieves performance comparable to state-of-the-art EEGFMs. A comprehensive analysis is performed on 8 benchmark datasets of clinical tasks using EEG for classification, along with ablation studies. Our proposed adapter is lightweight in trainable parameters and flexible in the inputs it can accommodate, supporting easy modeling of EEG data using TSFMs.

artificial intelligence, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2511.10848

Country: North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.14)

Genre: Research Report (1.00)

Industry: Health & Medicine > Therapeutic Area > Psychiatry/Psychology > Mental Health (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language (0.93)
Information Technology > Artificial Intelligence > Representation & Reasoning > Spatial Reasoning (0.92)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

TESSERA: Temporal Embeddings of Surface Spectra for Earth Representation and Analysis

Feng, Zhengpeng, Atzberger, Clement, Jaffer, Sadiq, Knezevic, Jovana, Sormunen, Silja, Young, Robin, Lisaius, Madeline C., Immitzer, Markus, Jackson, Toby, Ball, James, Coomes, David A., Madhavapeddy, Anil, Blake, Andrew, Keshav, Srinivasan

arXiv.org Artificial IntelligenceNov-21-2025

Satellite Earth-observation (EO) time series in the optical and microwave ranges of the electromagnetic spectrum are often irregular due to orbital patterns and cloud obstruction. Compositing addresses these issues but loses information with respect to vegetation phenology, which is critical for many downstream tasks. Instead, we present TESSERA, a pixel-wise foundation model for multi-modal (Sentinel-1/2) EO time series that learns robust, label-efficient em-beddings. During model training, TESSERA uses Barlow Twins and sparse random temporal sampling to enforce invariance to the selection of valid observations. W e employ two key regularizers: global shuffling to decorrelate spatial neighborhoods and mix-based regulation to improve invariance under extreme sparsity. W e find that for diverse classification, segmentation, and regression tasks, TESSERA embeddings deliver state-of-the-art accuracy with high label efficiency, often requiring only a small task head and minimal computation. T o democratize access, adhere to F AIR principles, and simplify use, we release global, annual, 10m, pixel-wise int8 embeddings together with open weights/code and lightweight adaptation heads, thus providing practical tooling for large-scale retrieval and inference at planetary scale. The model training/inference code, downstream task code, and pre-generated embeddings can be accessed at https://github.com/ucam-eo.

artificial intelligence, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2506.2038

Country:

Europe (1.00)
North America > United States (0.28)

Genre: Research Report (0.81)

Industry:

Information Technology (0.46)
Energy (0.32)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
(4 more...)

Add feedback

DeepThinkVLA: Enhancing Reasoning Capability of Vision-Language-Action Models

Yin, Cheng, Lin, Yankai, Xu, Wang, Tam, Sikyuen, Zeng, Xiangrui, Liu, Zhiyuan, Yin, Zhouping

arXiv.org Artificial IntelligenceNov-20-2025

Enabling Vision-Language-Action (VLA) models to "think before acting" via Chain-of-Thought (CoT) is a promising path to overcoming the data-hungry nature of end-to-end robot policies. However, progress is stalled by a fundamental conflict: existing models use a single autoregressive decoder for both sequential CoT reasoning and high-dimensional, parallelizable robot actions. This architectural mismatch degrades motor control and fails to forge a strong causal link between thought and action. We introduce DeepThinkVLA, which resolves this conflict through a tightly integrated architecture and training strategy. Architecturally, our hybrid-attention decoder generates sequential CoT with causal attention and then switches to bidirectional attention for fast, parallel decoding of action vectors. This design is complemented by a two-stage training pipeline: we first use Supervised Fine-Tuning (SFT) to teach the model foundational reasoning, then apply Reinforcement Learning (RL) with task-success rewards to causally align the full reasoning-action sequence with desired outcomes. This synergy leads to state-of-the-art performance, achieving a 97.0% Our ablations confirm the design's effectiveness: the hybrid architecture alone outperforms standard decoders by 15.5%, and the final RL stage provides a crucial 2% boost to secure top performance. Vision-Language-Action (VLA) models have driven notable progress in robotic manipulation, enabling tasks like stacking blocks, opening drawers, and arranging household objects (Huang et al., 2023; Zitkovich et al., 2023; Y ang et al., 2024; Cadene et al., 2024). The dominant paradigm learns a reactive, end-to-end policy that directly maps high-level goals and sensory inputs to low-level motor commands (Chi et al., 2023; Kim et al., 2024; Bjorck et al., 2025).

artificial intelligence, arxiv preprint arxiv, machine learning, (16 more...)

arXiv.org Artificial Intelligence

2511.15669

Country: Asia > China (0.14)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Cognitive Science > Problem Solving (0.64)
Information Technology > Artificial Intelligence > Representation & Reasoning > Spatial Reasoning (0.46)

Add feedback