spatial perception
Aerial Vision-Language Navigation with a Unified Framework for Spatial, Temporal and Embodied Reasoning
Xu, Huilin, Liu, Zhuoyang, Luomei, Yixiang, Xu, Feng
Aerial Vision-and-Language Navigation (VLN) aims to enable unmanned aerial vehicles (UAVs) to interpret natural language instructions and navigate complex urban environments using onboard visual observation. This task holds promise for real-world applications such as low-altitude inspection, search-and-rescue, and autonomous aerial delivery. Existing methods often rely on panoramic images, depth inputs, or odometry to support spatial reasoning and action planning. These requirements increase system cost and integration complexity, thus hindering practical deployment for lightweight UAVs. We present a unified aerial VLN framework that operates solely on egocentric monocular RGB observations and natural language instructions. The model formulates navigation as a next-token prediction problem, jointly optimizing spatial perception, trajectory reasoning, and action prediction through prompt-guided multi-task learning. Moreover, we propose a keyframe selection strategy to reduce visual redundancy by retaining semantically informative frames, along with an action merging and label reweighting mechanism that mitigates long-tailed supervision imbalance and facilitates stable multi-task co-training. Extensive experiments on the Aerial VLN benchmark validate the effectiveness of our method. Under the challenging monocular RGB-only setting, our model achieves strong results across both seen and unseen environments. It significantly outperforms existing RGB-only baselines and narrows the performance gap with state-of-the-art panoramic RGB-D counterparts. Comprehensive ablation studies further demonstrate the contribution of our task design and architectural choices.
SPUR: A Plug-and-Play Framework for Integrating Spatial Audio Understanding and Reasoning into Large Audio-Language Models
Sakshi, S, Lokegaonkar, Vaibhavi, Zhang, Neil, Duraiswami, Ramani, Ghosh, Sreyan, Manocha, Dinesh, Lu, Lie
Spatial perception is central to auditory intelligence, enabling accurate understanding of real-world acoustic scenes and advancing human-level perception of the world around us. While recent large audio-language models (LALMs) show strong reasoning over complex audios, most operate on monaural inputs and lack the ability to capture spatial cues such as direction, elevation, and distance. We introduce SPUR, a lightweight, plug-in approach that equips LALMs with spatial perception through minimal architectural changes. SPUR consists of: (i) a First-Order Ambisonics (FOA) encoder that maps (W, X, Y, Z) channels to rotation-aware, listener-centric spatial features, integrated into target LALMs via a multimodal adapter; and (ii) SPUR-Set, a spatial QA dataset combining open-source FOA recordings with controlled simulations, emphasizing relative direction, elevation, distance, and overlap for supervised spatial reasoning. Fine-tuning our model on the SPUR-Set consistently improves spatial QA and multi-speaker attribution while preserving general audio understanding. SPUR provides a simple recipe that transforms monaural LALMs into spatially aware models. Extensive ablations validate the effectiveness of our approach.
NavSpace: How Navigation Agents Follow Spatial Intelligence Instructions
Yang, Haolin, Long, Yuxing, Yu, Zhuoyuan, Yang, Zihan, Wang, Minghan, Xu, Jiapeng, Wang, Yihan, Yu, Ziyan, Cai, Wenzhe, Kang, Lei, Dong, Hao
Instruction-following navigation is a key step toward embodied intelligence. Prior benchmarks mainly focus on semantic understanding but overlook systematically evaluating navigation agents' spatial perception and reasoning capabilities. In this work, we introduce the NavSpace benchmark, which contains six task categories and 1,228 trajectory-instruction pairs designed to probe the spatial intelligence of navigation agents. On this benchmark, we comprehensively evaluate 22 navigation agents, including state-of-the-art navigation models and multimodal large language models. The evaluation results lift the veil on spatial intelligence in embodied navigation. Furthermore, we propose SNav, a new spatially intelligent navigation model. SNav outperforms existing navigation agents on NavSpace and real robot tests, establishing a strong baseline for future work.
Spatial 3D-LLM: Exploring Spatial Awareness in 3D Vision-Language Models
Wang, Xiaoyan, Li, Zeju, Xu, Yifan, Qi, Jiaxing, Yang, Zhifei, Ma, Ruifei, Liu, Xiangde, Zhang, Chao
New era has unlocked exciting possibilities for extending Large Language Models (LLMs) to tackle 3D vision-language tasks. However, most existing 3D multimodal LLMs (MLLMs) rely on compressing holistic 3D scene information or segmenting independent objects to perform these tasks, which limits their spatial awareness due to insufficient representation of the richness inherent in 3D scenes. To overcome these limitations, we propose Spatial 3D-LLM, a 3D MLLM specifically designed to enhance spatial awareness for 3D vision-language tasks by enriching the spatial embeddings of 3D scenes. Spatial 3D-LLM integrates an LLM backbone with a progressive spatial awareness scheme that progressively captures spatial information as the perception field expands, generating location-enriched 3D scene embeddings to serve as visual prompts. Furthermore, we introduce two novel tasks: 3D object distance measurement and 3D layout editing, and construct a 3D instruction dataset, MODEL, to evaluate the model's spatial awareness capabilities. Experimental results demonstrate that Spatial 3D-LLM achieves state-of-the-art performance across a wide range of 3D vision-language tasks, revealing the improvements stemmed from our progressive spatial awareness scheme of mining more profound spatial information. Our code is available at https://github.com/bjshuyuan/Spatial-3D-LLM.
Can LLMs Learn to Map the World from Local Descriptions?
Xia, Sirui, Chen, Aili, Wang, Xintao, Zhu, Tinghui, Zhang, Yikai, Chen, Jiangjie, Xiao, Yanghua
Recent advances in Large Language Models (LLMs) have demonstrated strong capabilities in tasks such as code and mathematics. However, their potential to internalize structured spatial knowledge remains underexplored. This study investigates whether LLMs, grounded in locally relative human observations, can construct coherent global spatial cognition by integrating fragmented relational descriptions. We focus on two core aspects of spatial cognition: spatial perception, where models infer consistent global layouts from local positional relationships, and spatial navigation, where models learn road connectivity from trajectory data and plan optimal paths between unconnected locations. Experiments conducted in a simulated urban environment demonstrate that LLMs not only generalize to unseen spatial relationships between points of interest (POIs) but also exhibit latent representations aligned with real-world spatial distributions. Furthermore, LLMs can learn road connectivity from trajectory descriptions, enabling accurate path planning and dynamic spatial awareness during navigation.
Approach to Visual Attractiveness of Event Space Through Data-Driven Environment and Spatial Perception
Majiid, Aliffi, Mian, Riaz-Ul-Haque, Kurohara, Kouki, Nguyen-Tran, Yen-Khang
Revitalizing Japan's remote areas has become a crucial task, and Matsue City exemplifies this effort in its temporary event spaces, created through collective efforts to foster urban vibrancy and bring together residents and visitors. This research examines the relationship between data-driven in-sights using generative AI and visual attractiveness by evaluating tempo-rary events in Matsue City, particularly considering the cognitive-cultural differences in processing visual information of the participants. The first phase employs semantic keyword extraction from interviews, categorizing responses into physical elements, activities, and atmosphere. The second phase analyzes spatial perception through three categories: layout hierar-chy, product visibility, and visual attention. The correlation indicates that successful event design requires a balance between spatial efficiency and diverse needs, with a spatial organization that optimizes visitor flow and visibility strategies considering cultural and demographic diversity. These findings contribute to understanding the urban quality of temporary event spaces and offer a replicable framework for enhancing the visual appeal of events in remote areas throughout Japan.
ScratchEval: Are GPT-4o Smarter than My Child? Evaluating Large Multimodal Models with Visual Programming Challenges
Fu, Rao, Luo, Ziyang, Lin, Hongzhan, Ye, Zhen, Ma, Jing
Recent advancements in large multimodal models (LMMs) have showcased impressive code generation capabilities, primarily evaluated through image-to-code benchmarks. However, these benchmarks are limited to specific visual programming scenarios where the logic reasoning and the multimodal understanding capacities are split apart. To fill this gap, we propose ScratchEval, a novel benchmark designed to evaluate the visual programming reasoning ability of LMMs. ScratchEval is based on Scratch, a block-based visual programming language widely used in children's programming education. By integrating visual elements and embedded programming logic, ScratchEval requires the model to process both visual information and code structure, thereby comprehensively evaluating its programming intent understanding ability. Our evaluation approach goes beyond the traditional image-to-code mapping and focuses on unified logical thinking and problem-solving abilities, providing a more comprehensive and challenging framework for evaluating the visual programming ability of LMMs. ScratchEval not only fills the gap in existing evaluation methods, but also provides new insights for the future development of LMMs in the field of visual programming. Our benchmark can be accessed at https://github.com/HKBUNLP/ScratchEval .
Bayesian Heuristics for Robust Spatial Perception
Chughtai, Aamir Hussain, Tahir, Muhammad, Uppal, Momin
Spatial perception is a key task in several machine intelligence applications such as robotics and computer vision. In general, it involves the nonlinear estimation of hidden variables that represent the system's state. However, in the presence of measurement outliers, the standard nonlinear least squared formulation results in poor estimates. Several methods have been considered in the literature to improve the reliability of the estimation process. Most methods are based on heuristics since guaranteed global robust estimation is not generally practical due to high computational costs. Recently general purpose robust estimation heuristics have been proposed that leverage existing non-minimal solvers available for the outlier-free formulations without the need for an initial guess. In this work, we propose three Bayesian heuristics that have similar structures. We evaluate these heuristics in practical scenarios to demonstrate their merits in different applications including 3D point cloud registration, mesh registration and pose graph optimization. The general computational advantages our proposals offer make them attractive candidates for spatial perception tasks.
Giving robots human-like perception of their physical environments
To carry out such high-level tasks, researchers believe robots will have to be able to perceive their physical environment as humans do. "In order to make any decision in the world, you need to have a mental model of the environment around you," says Luca Carlone, assistant professor of aeronautics and astronautics at MIT. "This is something so effortless for humans. But for robots it's a painfully hard problem, where it's about transforming pixel values that they see through a camera, into an understanding of the world." Now Carlone and his students have developed a representation of spatial perception for robots that is modeled after the way humans perceive and navigate the world. The new model, which they call 3D Dynamic Scene Graphs, enables a robot to quickly generate a 3D map of its surroundings that also includes objects and their semantic labels (a chair versus a table, for instance), as well as people, rooms, walls, and other structures that the robot is likely seeing in its environment.
Alexa, go to the kitchen and fetch me a snack
Wouldn't we all appreciate a little help around the house, especially if that help came in the form of a smart, adaptable, uncomplaining robot? Sure, there are the one-trick Roombas of the appliance world. But MIT engineers are envisioning robots more like home helpers, able to follow high-level, Alexa-type commands, such as "Go to the kitchen and fetch me a coffee cup." To carry out such high-level tasks, researchers believe robots will have to be able to perceive their physical environment as humans do. "In order to make any decision in the world, you need to have a mental model of the environment around you," says Luca Carlone, assistant professor of aeronautics and astronautics at MIT. "This is something so effortless for humans. But for robots it's a painfully hard problem, where it's about transforming pixel values that they see through a camera, into an understanding of the world."