Spatial Reasoning
MedVQA-TREE: A Multimodal Reasoning and Retrieval Framework for Sarcopenia Prediction
Moradbeiki, Pardis, Ghadiri, Nasser, Zahabi, Sayed Jalal, Wiil, Uffe Kock, Brockhattingen, Kristoffer Kittelmann, Ebrahimi, Ali
Accurate sarcopenia diagnosis via ultrasound remains challenging due to subtle imaging cues, limited labeled data, and the absence of clinical context in most models. We propose MedVQA-TREE, a multimodal framework that integrates a hierarchical image interpretation module, a gated feature-level fusion mechanism, and a novel multi-hop, multi-query retrieval strategy. The vision module includes anatomical classification, region segmentation, and graph-based spatial reasoning to capture coarse, mid-level, and fine-grained structures. A gated fusion mechanism selectively integrates visual features with textual queries, while clinical knowledge is retrieved through a UMLS-guided pipeline accessing PubMed and a sarcopenia-specific external knowledge base. MedVQA-TREE was trained and evaluated on two public MedVQA datasets (VQA-RAD and PathVQA) and a custom sarcopenia ultrasound dataset. The model achieved up to 99% diagnostic accuracy and outperformed previous state-of-the-art methods by over 10%. These results underscore the benefit of combining structured visual understanding with guided knowledge retrieval for effective AI-assisted diagnosis in sarcopenia.
11Plus-Bench: Demystifying Multimodal LLM Spatial Reasoning with Cognitive-Inspired Analysis
Li, Chengzu, Wu, Wenshan, Zhang, Huanyu, Li, Qingtao, Gao, Zeyu, Xia, Yan, Hernรกndez-Orallo, Josรฉ, Vuliฤ, Ivan, Wei, Furu
For human cognitive process, spatial reasoning and perception are closely entangled, yet the nature of this interplay remains underexplored in the evaluation of multimodal large language models (MLLMs). While recent MLLM advancements show impressive performance on reasoning, their capacity for human-like spatial cognition remains an open question. In this work, we introduce a systematic evaluation framework to assess the spatial reasoning abilities of state-of-the-art MLLMs relative to human performance. Central to our work is 11Plus-Bench, a high-quality benchmark derived from realistic standardized spatial aptitude tests. 11Plus-Bench also features fine-grained expert annotations of both perceptual complexity and reasoning process, enabling detailed instance-level analysis of model behavior. Through extensive experiments across 14 MLLMs and human evaluation, we find that current MLLMs exhibit early signs of spatial cognition. Despite a large performance gap compared to humans, MLLMs' cognitive profiles resemble those of humans in that cognitive effort correlates strongly with reasoning-related complexity. However, instance-level performance in MLLMs remains largely random, whereas human correctness is highly predictable and shaped by abstract pattern complexity. These findings highlight both emerging capabilities and limitations in current MLLMs' spatial reasoning capabilities and provide actionable insights for advancing model design.
Geo2Vec: Shape- and Distance-Aware Neural Representation of Geospatial Entities
Spatial representation learning is essential for GeoAI applications such as urban analytics, enabling the encoding of shapes, locations, and spatial relationships (topological and distance-based) of geo-entities like points, polylines, and polygons. Existing methods either target a single geo-entity type or, like Poly2Vec, decompose entities into simpler components to enable Fourier transformation, introducing high computational cost. Moreover, since the transformed space lacks geometric alignment, these methods rely on uniform, non-adaptive sampling, which blurs fine-grained features like edges and boundaries. To address these limitations, we introduce Geo2Vec, a novel method inspired by signed distance fields (SDF) that operates directly in the original space. Geo2Vec adaptively samples points and encodes their signed distances (positive outside, negative inside), capturing geometry without decomposition. A neural network trained to approximate the SDF produces compact, geometry-aware, and unified representations for all geo-entity types. Additionally, we propose a rotation-invariant positional encoding to model high-frequency spatial variations and construct a structured and robust embedding space for downstream GeoAI models. Empirical results show that Geo2Vec consistently outperforms existing methods in representing shape and location, capturing topological and distance relationships, and achieving greater efficiency in real-world GeoAI applications. Code and Data can be found at: https://github.com/chuchen2017/GeoNeuralRepresentation.
DreamVLA: A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge
Zhang, Wenyao, Liu, Hongsi, Qi, Zekun, Wang, Yunnan, Yu, Xinqiang, Zhang, Jiazhao, Dong, Runpei, He, Jiawei, Lu, Fan, Wang, He, Zhang, Zhizheng, Yi, Li, Zeng, Wenjun, Jin, Xin
Recent advances in vision-language-action (VLA) models have shown promise in integrating image generation with action prediction to improve generalization and reasoning in robot manipulation. However, existing methods are limited to challenging image-based forecasting, which suffers from redundant information and lacks comprehensive and critical world knowledge, including dynamic, spatial and semantic information. To address these limitations, we propose DreamVLA, a novel VLA framework that integrates comprehensive world knowledge forecasting to enable inverse dynamics modeling, thereby establishing a perception-prediction-action loop for manipulation tasks. Specifically, DreamVLA introduces a dynamic-region-guided world knowledge prediction, integrated with the spatial and semantic cues, which provide compact yet comprehensive representations for action planning. This design aligns with how humans interact with the world by first forming abstract multimodal reasoning chains before acting. To mitigate interference among the dynamic, spatial and semantic information during training, we adopt a block-wise structured attention mechanism that masks their mutual attention, preventing information leakage and keeping each representation clean and disentangled. Moreover, to model the conditional distribution over future actions, we employ a diffusion-based transformer that disentangles action representations from shared latent features. Extensive experiments on both real-world and simulation environments demonstrate that DreamVLA achieves 76.7% success rate on real robot tasks and 4.44 average length on the CALVIN ABC-D benchmarks.
Jigsaw-Puzzles: From Seeing to Understanding to Reasoning in Vision-Language Models
Lyu, Zesen, Zhang, Dandan, Ye, Wei, Li, Fangdi, Jiang, Zhihang, Yang, Yao
Spatial reasoning is a core component of human cognition, enabling individuals to perceive, comprehend, and interact with the physical world. It relies on a nuanced understanding of spatial structures and inter-object relationships, serving as the foundation for complex reasoning and decision-making. To investigate whether current vision-language models (VLMs) exhibit similar capability, we introduce Jigsaw-Puzzles, a novel benchmark consisting of 1,100 carefully curated real-world images with high spatial complexity. Based on this dataset, we design five tasks to rigorously evaluate VLMs' spatial perception, structural understanding, and reasoning capabilities, while deliberately minimizing reliance on domain-specific knowledge to better isolate and assess the general spatial reasoning capability. We conduct a comprehensive evaluation across 24 state-of-the-art VLMs. The results show that even the strongest model, Gemini-2.5-Pro, achieves only 77.14% overall accuracy and performs particularly poorly on the Order Generation task, with only 30.00% accuracy, far below the performance exceeding 90% achieved by human participants. This persistent gap underscores the need for continued progress, positioning Jigsaw-Puzzles as a challenging and diagnostic benchmark for advancing spatial reasoning research in VLMs. Our project page is at https://zesen01.github.io/jigsaw-puzzles.
From reactive to cognitive: brain-inspired spatial intelligence for embodied agents
Ruan, Shouwei, Wang, Liyuan, Kang, Caixin, Zhu, Qihui, Liu, Songming, Wei, Xingxing, Su, Hang
Spatial cognition enables adaptive goal-directed behavior by constructing internal models of space. Robust biological systems consolidate spatial knowledge into three interconnected forms: \textit{landmarks} for salient cues, \textit{route knowledge} for movement trajectories, and \textit{survey knowledge} for map-like representations. While recent advances in multi-modal large language models (MLLMs) have enabled visual-language reasoning in embodied agents, these efforts lack structured spatial memory and instead operate reactively, limiting their generalization and adaptability in complex real-world environments. Here we present Brain-inspired Spatial Cognition for Navigation (BSC-Nav), a unified framework for constructing and leveraging structured spatial memory in embodied agents. BSC-Nav builds allocentric cognitive maps from egocentric trajectories and contextual cues, and dynamically retrieves spatial knowledge aligned with semantic goals. Integrated with powerful MLLMs, BSC-Nav achieves state-of-the-art efficacy and efficiency across diverse navigation tasks, demonstrates strong zero-shot generalization, and supports versatile embodied behaviors in the real physical world, offering a scalable and biologically grounded path toward general-purpose spatial intelligence.
STGAtt: A Spatial-Temporal Unified Graph Attention Network for Traffic Flow Forecasting
Liang, Zhuding, Cui, Jianxun, Zeng, Qingshuang, Liu, Feng, Filipovic, Nenad, Geroski, Tijana
Accurate and timely traffic flow forecasting is crucial for intelligent transportation systems. This paper presents a novel deep learning model, the Spatial-Temporal Unified Graph Attention Network (STGAtt). By leveraging a unified graph representation and an attention mechanism, STGAtt effectively captures complex spatial-temporal dependencies. Unlike methods relying on separate spatial and temporal dependency modeling modules, STGAtt directly models correlations within a Spatial-Temporal Unified Graph, dynamically weighing connections across both dimensions. To further enhance its capabilities, STGAtt partitions traffic flow observation signal into neighborhood subsets and employs a novel exchanging mechanism, enabling effective capture of both short-range and long-range correlations. Extensive experiments on the PEMS-BAY and SHMetro datasets demonstrate STGAtt's superior performance compared to state-of-the-art baselines across various prediction horizons. Visualization of attention weights confirms STGAtt's ability to adapt to dynamic traffic patterns and capture long-range dependencies, highlighting its potential for real-world traffic flow forecasting applications.
MuST2-Learn: Multi-view Spatial-Temporal-Type Learning for Heterogeneous Municipal Service Time Estimation
Asif, Nadia, Hong, Zhiqing, Ren, Shaogang, Zhang, Xiaonan, Shang, Xiaojun, Yuan, Yukun
Non-emergency municipal services such as city 311 systems have been widely implemented across cities in Canada and the United States to enhance residents' quality of life. These systems enable residents to report issues, e.g., noise complaints, missed garbage collection, and potholes, via phone calls, mobile applications, or webpages. However, residents are often given limited information about when their service requests will be addressed, which can reduce transparency, lower resident satisfaction, and increase the number of follow-up inquiries. Predicting the service time for municipal service requests is challenging due to several complex factors: dynamic spatial-temporal correlations, underlying interactions among heterogeneous service request types, and high variation in service duration even within the same request category. In this work, we propose MuST2-Learn: a Multi-view Spatial-Temporal-Type Learning framework designed to address the aforementioned challenges by jointly modeling spatial, temporal, and service type dimensions. In detail, it incorporates an inter-type encoder to capture relationships among heterogeneous service request types and an intra-type variation encoder to model service time variation within homogeneous types. In addition, a spatiotemporal encoder is integrated to capture spatial and temporal correlations in each request type. The proposed framework is evaluated with extensive experiments using two real-world datasets. The results show that MuST2-Learn reduces mean absolute error by at least 32.5%, which outperforms state-of-the-art methods.
Machine Learning in Micromobility: A Systematic Review of Datasets, Techniques, and Applications
Yan, Sen, Kaundanya, Chinmaya, O'Connor, Noel E., Little, Suzanne, Liu, Mingming
Micromobility systems, which include lightweight and low-speed vehicles such as bicycles, e-bikes, and e-scooters, have become an important part of urban transportation and are used to solve problems such as traffic congestion, air pollution, and high transportation costs. Successful utilisation of micromobilities requires optimisation of complex systems for efficiency, environmental impact mitigation, and overcoming technical challenges for user safety. Machine Learning (ML) methods have been crucial to support these advancements and to address their unique challenges. However, there is insufficient literature addressing the specific issues of ML applications in micromobilities. This survey paper addresses this gap by providing a comprehensive review of datasets, ML techniques, and their specific applications in micromobilities. Specifically, we collect and analyse various micromobility-related datasets and discuss them in terms of spatial, temporal, and feature-based characteristics. In addition, we provide a detailed overview of ML models applied in micromobilities, introducing their advantages, challenges, and specific use cases. Furthermore, we explore multiple ML applications, such as demand prediction, energy management, and safety, focusing on improving efficiency, accuracy, and user experience. Finally, we propose future research directions to address these issues, aiming to help future researchers better understand this field.
TRUST-Planner: Topology-guided Robust Trajectory Planner for AAVs with Uncertain Obstacle Spatial-temporal Avoidance
Li, Junzhi, Long, Teng, Sun, Jingliang, Zhong, Jianxin
--Despite extensive developments in motion planning of autonomous aerial vehicles (AA Vs), existing frameworks faces the challenges of local minima and deadlock in complex dynamic environments, leading to increased collision risks. T o address these challenges, we present TRUST -Planner, a topology-guided hierarchical planning framework for robust spatial-temporal obstacle avoidance. In the frontend, a dynamic enhanced visible probabilistic roadmap (DEV-PRM) is proposed to rapidly explore topological paths for global guidance. The backend utilizes a uniform terminal-free minimum control polynomial (UTF-MINCO) and dynamic distance field (DDF) to enable efficient predictive obstacle avoidance and fast parallel computation. Furthermore, an incremental multi-branch trajectory management framework is introduced to enable spatio-temporal topological decision-making, while efficiently leveraging historical information to reduce replanning time. Simulation results show that TRUST - Planner outperforms baseline competitors, achieving a 96% success rate and millisecond-level computation efficiency in tested complex environments. Real-world experiments further validate the feasibility and practicality of the proposed method. URRENTL Y, autonomous aerial vehicles (AA Vs) play an increasingly crucial role in widespread areas, e.g., transportation, search and rescue, and sightseeing [1]. With the deepening of these applications, AA Vs are required to operate in increasingly complex and dynamic flight environments, where moving obstacles, such as pedestrians, vehicles, and other AA Vs, present growing risks of collision [2].