AITopics | Spatial Reasoning

Collaborating Authors

Spatial Reasoning

News Overviews Instructional Materials AI-Alerts Classics

Navigation and Exploration with Active Inference: from Biology to Industry

de Tinguy, Daria, Verbelen, Tim, Dhoedt, Bart

arXiv.org Artificial IntelligenceOct-13-2025

By building and updating internal cognitive maps, animals exhibit extraordinary navigation abilities in complex, dynamic environments. Inspired by these biological mechanisms, we present a real time robotic navigation system grounded in the Active Inference Framework (AIF). Our model incrementally constructs a topological map, infers the agent's location, and plans actions by minimising expected uncertainty and fulfilling perceptual goals without any prior training. Integrated into the ROS2 ecosystem, we validate its adaptability and efficiency across both 2D and 3D environments (simulated and real world), demonstrating competitive performance with traditional and state of the art exploration approaches while offering a biologically inspired navigation approach.

artificial intelligence, machine learning, spatial reasoning, (16 more...)

arXiv.org Artificial Intelligence

2508.07269

Country: North America > United States > California (0.28)

Genre: Research Report (1.00)

Industry: Health & Medicine > Therapeutic Area > Neurology (0.47)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Cognitive Science (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models (0.46)
(2 more...)

Add feedback

iMoWM: Taming Interactive Multi-Modal World Model for Robotic Manipulation

Zhang, Chuanrui, Wu, Zhengxian, Lu, Guanxing, Tang, Yansong, Wang, Ziwei

arXiv.org Artificial IntelligenceOct-13-2025

Learned world models hold significant potential for robotic manipulation, as they can serve as simulator for real-world interactions. While extensive progress has been made in 2D video-based world models, these approaches often lack geometric and spatial reasoning, which is essential for capturing the physical structure of the 3D world. To address this limitation, we introduce iMoWM, a novel interactive world model designed to generate color images, depth maps, and robot arm masks in an autoregressive manner conditioned on actions. To overcome the high computational cost associated with three-dimensional information, we propose MMTokenizer, which unifies multi-modal inputs into a compact token representation. This design enables iMoWM to leverage large-scale pretrained VideoGPT models while maintaining high efficiency and incorporating richer physical information. With its multi-modal representation, iMoWM not only improves the visual quality of future predictions but also serves as an effective simulator for model-based reinforcement learning (MBRL) and facilitates real-world imitation learning. Extensive experiments demonstrate the superiority of iMoWM across these tasks, showcasing the advantages of multi-modal world modeling for robotic manipulation. Homepage: https://xingyoujun.github.io/imowm/

arxiv preprint arxiv, large language model, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2510.09036

Country: Asia (0.28)

Genre: Research Report > New Finding (0.94)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Cognitive Science > Problem Solving (1.00)
(2 more...)

Add feedback

BEAR: Benchmarking and Enhancing Multimodal Language Models for Atomic Embodied Capabilities

Qi, Yu, Zhao, Haibo, Guo, Ziyu, Ma, Siyuan, Chen, Ziyan, Han, Yaokun, Zhang, Renrui, Lin, Zitiantao, Xin, Shiji, Huang, Yijian, Cheng, Kai, Wang, Peiheng, Liu, Jiazheng, Zhang, Jiayi, Zhu, Yizhe, Wang, Wenqing, Qin, Yiran, Zhu, Xupeng, Huang, Haojie, Wong, Lawson L. S.

arXiv.org Artificial IntelligenceOct-13-2025

Embodied capabilities refer to a suite of fundamental abilities for an agent to perceive, comprehend, and interact with the physical world. While multimodal large language models (MLLMs) show promise as embodied agents, a thorough and systematic evaluation of their embodied capabilities remains underexplored, as existing benchmarks primarily focus on specific domains such as planning or spatial understanding. To bridge this gap, we introduce BEAR, a comprehensive and fine-grained benchmark that evaluates MLLMs on atomic embodied capabilities. BEAR comprises 4,469 interleaved image-video-text entries across 14 domains in 6 categories, including tasks from low-level pointing, trajectory understanding, spatial reasoning, to high-level planning. Extensive evaluation results of 20 representative MLLMs reveal their persistent limitations across all domains of embodied capabilities. To tackle the shortfall, we propose BEAR-Agent, a multimodal conversable agent that integrates pretrained vision models to strengthen MLLM perception, 3D understanding, and planning capabilities. It substantially enhances MLLM performance across diverse embodied capabilities on BEAR, yielding a 9.12% absolute gain and a relative improvement of 17.5% on GPT-5. Furthermore, our experiments indicate that improving MLLM embodied capabilities can benefit embodied tasks in simulated environments. Project website: https://bear-official66.github.io/

category, large language model, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2510.08759

Genre: Research Report > New Finding (0.92)

Industry:

Information Technology (0.45)
Appliances & Durable Goods (0.45)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Spatial Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

SpatialPIN: Enhancing Spatial Reasoning Capabilities

Neural Information Processing SystemsOct-11-2025, 00:28:19 GMT

To this end, we propose SpatialPIN, a framework that utilizes progressive prompting and interactions between VLMs and 2D/3D foundation models as "free lunch" to enhance spatial reasoning capabilities

axis, dataset, vlm, (11 more...)

Neural Information Processing Systems

Country:

Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
Europe > Netherlands > North Holland > Amsterdam (0.04)
Asia > Middle East > Republic of Türkiye > Karaman Province > Karaman (0.04)

Genre: Research Report > Experimental Study (0.93)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Spatial Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.46)

Add feedback

SpatialRGPT: Grounded Spatial Reasoning in Vision-Language Models

Neural Information Processing SystemsOct-10-2025, 21:26:43 GMT

Vision Language Models (VLMs) have demonstrated remarkable performance in 2D vision and language tasks. However, their ability to reason about spatial arrangements remains limited. In this work, we introduce Spatial Region GPT (SpatialRGPT) to enhance VLMs' spatial perception and reasoning capabilities.

arxiv preprint arxiv, dataset, spatialrgpt, (14 more...)

Neural Information Processing Systems

Country:

South America > Brazil (0.04)
North America > United States > California > San Diego County > San Diego (0.04)
Europe > France > Bourgogne-Franche-Comté > Doubs > Besançon (0.04)

Genre:

Research Report > New Finding (0.93)
Research Report > Experimental Study (0.93)

Industry: Information Technology (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)
Information Technology > Artificial Intelligence > Representation & Reasoning > Spatial Reasoning (0.84)

Add feedback

Synergistic Dual Spatial-aware Generation of Image-to-Text and Text-to-Image Y u Zhao

Neural Information Processing SystemsOct-10-2025, 18:53:08 GMT

In the visual spatial understanding (VSU) area, spatial image-to-text (SI2T) and spatial text-to-image (ST2I) are two fundamental tasks that appear in dual form. Existing methods for standalone SI2T or ST2I perform imperfectly in spatial understanding, due to the difficulty of 3D-wise spatial feature modeling.

diffusion model, proceedings, representation, (10 more...)

Neural Information Processing Systems

Country:

North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
Europe > Italy > Tuscany > Florence (0.04)
Europe > Ireland > Leinster > County Dublin > Dublin (0.04)
(22 more...)

Genre: Research Report > Experimental Study (1.00)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Spatial Reasoning (1.00)
(2 more...)

Add feedback

Learning from Highly Sparse Spatio-temporal Data

Neural Information Processing SystemsOct-10-2025, 12:47:42 GMT

Incomplete spatio-temporal data in the real world has spawned much research.

dataset, information, st point, (17 more...)

Neural Information Processing Systems

Country:

Europe > Spain > Galicia > Madrid (0.04)
Asia > China (0.04)
Pacific Ocean > North Pacific Ocean > San Francisco Bay (0.04)
North America > United States > California > San Francisco County > San Francisco (0.04)

Genre: Research Report > Experimental Study (0.93)

Industry: Energy (0.46)

Technology:

Information Technology > Data Science > Data Mining (0.92)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)
Information Technology > Artificial Intelligence > Representation & Reasoning > Spatial Reasoning (0.67)

Add feedback

CYCLO: Cyclic Graph Transformer Approach to Multi-Object Relationship Modeling in Aerial Videos

Neural Information Processing SystemsOct-10-2025, 12:08:52 GMT

In this paper, we introduce the new Aero-Eye dataset that focuses on multi-object relationship modeling in aerial videos.

dataset, proceedings, video, (10 more...)

Neural Information Processing Systems

Country:

Europe > Switzerland > Zürich > Zürich (0.14)
North America > United States > Arkansas (0.04)
Asia > Myanmar > Tanintharyi Region > Dawei (0.04)
(2 more...)

Genre: Research Report > Experimental Study (0.93)

Industry:

Leisure & Entertainment > Sports (0.92)
Transportation > Ground > Road (0.46)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Communications (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
(3 more...)

Add feedback

Supplementary Material: T orchSpatial-A Location Encoding Framework and Benchmark for Spatial Representation Learning

Neural Information Processing SystemsOct-10-2025, 10:08:05 GMT

Author ordering is determined by coin flip. For what purpose was the dataset created? Was there a specific task in mind? In order to systematically compare the location encoders' performance and their impact on the Who created the dataset (e.g., which team, research group) and on behalf of which entity (e.g., Who funded the creation of the dataset? Dr. Gengchen Mai acknowledges the Microsoft Research What do the instances that comprise the dataset represent (e.g., documents, photos, people, The instances in all 17 datasets represent images.

dataset, please describe, please provide, (16 more...)

Neural Information Processing Systems

Country: