spatial cognition
SpatialBench: Benchmarking Multimodal Large Language Models for Spatial Cognition
Xu, Peiran, Wang, Sudong, Zhu, Yao, Li, Jianing, Zhang, Yunjian
Spatial cognition is fundamental to real-world multimodal intelligence, allowing models to effectively interact with the physical environment. While multimodal large language models (MLLMs) have made significant strides, existing benchmarks often oversimplify spatial cognition, reducing it to a single-dimensional metric, which fails to capture the hierarchical structure and interdependence of spatial abilities. T o address this gap, we propose a hierarchical spatial cognition framework that decomposes spatial intelligence into five progressively complex levels from basic observation to high-level planning. Building upon this taxonomy, we construct SpatialBench, a large-scale, fine-grained benchmark covering 15 tasks aligned with these cognitive levels. T o provide a unified evaluation across heterogeneous tasks, we further introduce a high-level capability-oriented metric that reliably assesses a model's overall spatial reasoning ability. Extensive experiments over massive MLLMs reveal distinct performance stratification across cognitive levels: models exhibit strong perceptual grounding yet remain limited in symbolic reasoning, causal inference, and planning. Additional human tests demonstrate that humans perform selective, goal-directed abstraction, while MLLMs tend to over-attend to surface details without coherent spatial intent. Our work establishes the first systematic framework for measuring hierarchical spatial cognition in MLLMs, laying the foundation for future spatially intelligent systems.
- Information Technology > Artificial Intelligence > Representation & Reasoning > Spatial Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Cognitive Science (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)
Solving Spatial Supersensing Without Spatial Supersensing
Udandarao, Vishaal, Karthik, Shyamgopal, Nath, Surabhi S., Hochlehnert, Andreas, Bethge, Matthias, Prabhu, Ameya
Cambrian-S aims to take the first steps towards improving video world models with spatial supersensing by introducing (i) two benchmarks, VSI-Super-Recall (VSR) and VSI-Super-Counting (VSC), and (ii) bespoke predictive sensing inference strategies tailored to each benchmark. In this work, we conduct a critical analysis of Cambrian-S across both these fronts. First, we introduce a simple baseline, NoSense, which discards almost all temporal structure and uses only a bag-of-words SigLIP model, yet near-perfectly solves VSR, achieving 95% accuracy even on 4-hour videos. This shows benchmarks like VSR can be nearly solved without spatial cognition, world modeling or spatial supersensing. Second, we hypothesize that the tailored inference methods proposed by Cambrian-S likely exploit shortcut heuristics in the benchmark. We illustrate this with a simple sanity check on the VSC benchmark, called VSC-Repeat: We concatenate each video with itself 1-5 times, which does not change the number of unique objects. However, this simple perturbation entirely collapses the mean relative accuracy of Cambrian-S from 42% to 0%. A system that performs spatial supersensing and integrates information across experiences should recognize views of the same scene and keep object-count predictions unchanged; instead, Cambrian-S inference algorithm relies largely on a shortcut in the VSC benchmark that rooms are never revisited. Taken together, our findings suggest that (i) current VSI-Super benchmarks do not yet reliably measure spatial supersensing, and (ii) predictive-sensing inference recipes used by Cambrian-S improve performance by inadvertently exploiting shortcuts rather than from robust spatial supersensing. We include the response from the Cambrian-S authors (in Appendix A) to provide a balanced perspective alongside our claims. We release our code at: https://github.com/bethgelab/supersanity
- North America > United States (0.04)
- Asia > Middle East > Jordan (0.04)
RynnEC: Bringing MLLMs into Embodied World
Dang, Ronghao, Yuan, Yuqian, Mao, Yunxuan, Li, Kehan, Liu, Jiangpin, Wang, Zhikai, Li, Xin, Wang, Fan, Zhao, Deli
We introduce RynnEC, a video multimodal large language model designed for embodied cognition. Built upon a general-purpose vision-language foundation model, RynnEC incorporates a region encoder and a mask decoder, enabling flexible region-level video interaction. Despite its compact architecture, RynnEC achieves state-of-the-art performance in object property understanding, object segmentation, and spatial reasoning. Conceptually, it offers a region-centric video paradigm for the brain of embodied agents, providing fine-grained perception of the physical world and enabling more precise interactions. To mitigate the scarcity of annotated 3D datasets, we propose an egocentric video based pipeline for generating embodied cognition data. Furthermore, we introduce RynnEC-Bench, a region-centered benchmark for evaluating embodied cognitive capabilities. We anticipate that RynnEC will advance the development of general-purpose cognitive cores for embodied agents and facilitate generalization across diverse embodied tasks. The code, model checkpoints, and benchmark are available at: https://github.com/alibaba-damo-academy/RynnEC
A Preliminary Exploration of the Differences and Conjunction of Traditional PNT and Brain-inspired PNT
He, Xu, Meng, Xiaolin, Yin, Wenxuan, Zhang, Youdong, Mo, Lingfei, An, Xiangdong, Yu, Fangwen, Pan, Shuguo, Liu, Yufeng, Liu, Jingnan, Zhang, Yujia, Gao, Wang
Developing universal Positioning, Navigation, and Timing (PNT) is our enduring goal. Today's complex environments demand PNT that is more resilient, energy - efficient and cognitively capable. This paper asks how we can endow unmanned systems with brain - inspired spatial cogniti on navigation while exploiting the h igh precision of machine PNT to advance universal PNT. We provide a new perspective and roadmap for shifting PNT from "tool - or iented " to "cogniti on - driven ". Contributions: (1) multi - level dissection of differences among traditional PNT, biological brain PN T and brain - inspired PNT; (2) a four - layer (observation - c apability - decision - hardware) fusion framework that unites numerical precision and brain - inspired intelligence; (3) forward - looking recommendations for future development of brain - inspired PNT . Keywords: Brain - inspired n avigation, PNT, Differences and Conjunction, Fusion F ramework 1. Introduction Unmanned system P ositioning, N avigation, and T iming (PNT) technologies have achieved numerous practical advance s. Particularly noteworthy is the rapid maturation of Global Navigation Satellite System (GNSS) - based PNT, which has not only expanded its application domains but also driven down operational costs. However, these technologies still face formidable challenges in highly uncertain and complex scenarios, such as deep s pace, the deep ocean, polar regions, and dense urban environments.
- North America > United States (0.28)
- Asia > China > Hubei Province > Wuhan (0.04)
- North America > Canada > British Columbia > Vancouver (0.04)
- (2 more...)
- Health & Medicine > Therapeutic Area > Neurology (1.00)
- Information Technology > Security & Privacy (0.93)
11Plus-Bench: Demystifying Multimodal LLM Spatial Reasoning with Cognitive-Inspired Analysis
Li, Chengzu, Wu, Wenshan, Zhang, Huanyu, Li, Qingtao, Gao, Zeyu, Xia, Yan, Hernández-Orallo, José, Vulić, Ivan, Wei, Furu
For human cognitive process, spatial reasoning and perception are closely entangled, yet the nature of this interplay remains underexplored in the evaluation of multimodal large language models (MLLMs). While recent MLLM advancements show impressive performance on reasoning, their capacity for human-like spatial cognition remains an open question. In this work, we introduce a systematic evaluation framework to assess the spatial reasoning abilities of state-of-the-art MLLMs relative to human performance. Central to our work is 11Plus-Bench, a high-quality benchmark derived from realistic standardized spatial aptitude tests. 11Plus-Bench also features fine-grained expert annotations of both perceptual complexity and reasoning process, enabling detailed instance-level analysis of model behavior. Through extensive experiments across 14 MLLMs and human evaluation, we find that current MLLMs exhibit early signs of spatial cognition. Despite a large performance gap compared to humans, MLLMs' cognitive profiles resemble those of humans in that cognitive effort correlates strongly with reasoning-related complexity. However, instance-level performance in MLLMs remains largely random, whereas human correctness is highly predictable and shaped by abstract pattern complexity. These findings highlight both emerging capabilities and limitations in current MLLMs' spatial reasoning capabilities and provide actionable insights for advancing model design.
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.28)
- Europe > Austria > Vienna (0.14)
- North America > United States > New Jersey > Mercer County > Princeton (0.04)
- (3 more...)
- Research Report > Experimental Study (1.00)
- Research Report > New Finding (0.93)
- Education (1.00)
- Health & Medicine > Therapeutic Area > Neurology (0.68)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Spatial Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Cognitive Science > Problem Solving (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.71)
Mimicking associative learning of rats via a neuromorphic robot in open field maze using spatial cell models
Liu, Tianze, Siddique, Md Abu Bakr, An, Hongyu
--Data-driven Artificial Intelligence (AI) approaches have exhibited remarkable prowess across various cognitive tasks using extensive training data. However, the reliance on large datasets and neural networks presents challenges such as high-power consumption and limited adaptability, particularly in SWaP-constrained applications like planetary exploration. T o address these issues, we propose enhancing the autonomous capabilities of intelligent robots by emulating the associative learning observed in animals. Associative learning enables animals to adapt to their environment by memorizing concurrent events. By replicating this mechanism, neuromorphic robots can navigate dynamic environments autonomously, learning from interactions to optimize performance. This paper explores the emulation of associative learning in rodents using neuromorphic robots within open-field maze environments, leveraging insights from spatial cells such as place and grid cells. By integrating these models, we aim to enable online associative learning for spatial tasks in real-time scenarios, bridging the gap between biological spatial cognition and robotics for advancements in autonomous systems.
Can LLMs Learn to Map the World from Local Descriptions?
Xia, Sirui, Chen, Aili, Wang, Xintao, Zhu, Tinghui, Zhang, Yikai, Chen, Jiangjie, Xiao, Yanghua
Recent advances in Large Language Models (LLMs) have demonstrated strong capabilities in tasks such as code and mathematics. However, their potential to internalize structured spatial knowledge remains underexplored. This study investigates whether LLMs, grounded in locally relative human observations, can construct coherent global spatial cognition by integrating fragmented relational descriptions. We focus on two core aspects of spatial cognition: spatial perception, where models infer consistent global layouts from local positional relationships, and spatial navigation, where models learn road connectivity from trajectory data and plan optimal paths between unconnected locations. Experiments conducted in a simulated urban environment demonstrate that LLMs not only generalize to unseen spatial relationships between points of interest (POIs) but also exhibit latent representations aligned with real-world spatial distributions. Furthermore, LLMs can learn road connectivity from trajectory descriptions, enabling accurate path planning and dynamic spatial awareness during navigation.
A Survey of Large Language Model-Powered Spatial Intelligence Across Scales: Advances in Embodied Agents, Smart Cities, and Earth Science
Feng, Jie, Zeng, Jinwei, Long, Qingyue, Chen, Hongyi, Zhao, Jie, Xi, Yanxin, Zhou, Zhilun, Yuan, Yuan, Wang, Shengyuan, Zeng, Qingbin, Li, Songwei, Zhang, Yunke, Lin, Yuming, Li, Tong, Ding, Jingtao, Gao, Chen, Xu, Fengli, Li, Yong
Over the past year, the development of large language models (LLMs) has brought spatial intelligence into focus, with much attention on vision-based embodied intelligence. However, spatial intelligence spans a broader range of disciplines and scales, from navigation and urban planning to remote sensing and earth science. What are the differences and connections between spatial intelligence across these fields? In this paper, we first review human spatial cognition and its implications for spatial intelligence in LLMs. We then examine spatial memory, knowledge representations, and abstract reasoning in LLMs, highlighting their roles and connections. Finally, we analyze spatial intelligence across scales -- from embodied to urban and global levels -- following a framework that progresses from spatial memory and understanding to spatial reasoning and intelligence. Through this survey, we aim to provide insights into interdisciplinary spatial intelligence research and inspire future studies.
Does Spatial Cognition Emerge in Frontier Models?
Ramakrishnan, Santhosh Kumar, Wijmans, Erik, Kraehenbuehl, Philipp, Koltun, Vladlen
Not yet. We present SPACE, a benchmark that systematically evaluates spatial cognition in frontier models. Our benchmark builds on decades of research in cognitive science. It evaluates large-scale mapping abilities that are brought to bear when an organism traverses physical environments, smaller-scale reasoning about object shapes and layouts, and cognitive infrastructure such as spatial attention and memory. For many tasks, we instantiate parallel presentations via text and images, allowing us to benchmark both large language models and large multimodal models. Results suggest that contemporary frontier models fall short of the spatial intelligence of animals, performing near chance level on a number of classic tests of animal cognition.
- North America > United States > Minnesota (0.04)
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
- North America > United States > Virginia (0.04)
- (4 more...)
- Health & Medicine > Therapeutic Area > Neurology (1.00)
- Leisure & Entertainment (0.93)
- Education (0.93)
Failures in Perspective-taking of Multimodal AI Systems
Leonard, Bridget, Woodard, Kristin, Murray, Scott O.
This study extends previous research on spatial representations in multimodal AI systems. Although current models demonstrate a rich understanding of spatial information from images, this information is rooted in propositional representations, which differ from the analog representations employed in human and animal spatial cognition. To further explore these limitations, we apply techniques from cognitive and developmental science to assess the perspective-taking abilities of GPT-4o. Our analysis enables a comparison between the cognitive development of the human brain and that of multimodal AI, offering guidance for future research and model development.
- Information Technology > Artificial Intelligence > Cognitive Science (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.73)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Spatial Reasoning (0.70)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.41)