Spatial Reasoning
VistaDPO: Video Hierarchical Spatial-Temporal Direct Preference Optimization for Large Video Models
Huang, Haojian, Chen, Haodong, Wu, Shengqiong, Luo, Meng, Fu, Jinlan, Du, Xinya, Zhang, Hanwang, Fei, Hao
Large Video Models (LVMs) built upon Large Language Models (LLMs) have shown promise in video understanding but often suffer from misalignment with human intuition and video hallucination issues. To address these challenges, we introduce VistaDPO, a novel framework for Video Hierarchical Spatial-Temporal Direct Preference Optimization. VistaDPO enhances text-video preference alignment across three hierarchical levels: i) Instance Level, aligning overall video content with responses; ii) Temporal Level, aligning video temporal semantics with event descriptions; and iii) Perceptive Level, aligning spatial objects with language tokens. Given the lack of datasets for fine-grained video-language preference alignment, we construct VistaDPO-7k, a dataset of 7.2K QA pairs annotated with chosen and rejected responses, along with spatial-temporal grounding information such as timestamps, keyframes, and bounding boxes. Extensive experiments on benchmarks such as Video Hallucination, Video QA, and Captioning performance tasks demonstrate that VistaDPO significantly improves the performance of existing LVMs, effectively mitigating video-language misalignment and hallucination. The code and data are available at https://github.com/HaroldChen19/VistaDPO.
RoboTwin: Dual-Arm Robot Benchmark with Generative Digital Twins
Mu, Yao, Chen, Tianxing, Chen, Zanxin, Peng, Shijia, Lan, Zhiqian, Gao, Zeyu, Liang, Zhixuan, Yu, Qiaojun, Zou, Yude, Xu, Mingkun, Lin, Lunkai, Xie, Zhiqiang, Ding, Mingyu, Luo, Ping
In the rapidly advancing field of robotics, dual-arm coordination and complex object manipulation are essential capabilities for developing advanced autonomous systems. However, the scarcity of diverse, high-quality demonstration data and real-world-aligned evaluation benchmarks severely limits such development. To address this, we introduce RoboTwin, a generative digital twin framework that uses 3D generative foundation models and large language models to produce diverse expert datasets and provide a real-world-aligned evaluation platform for dual-arm robotic tasks. Specifically, RoboTwin creates varied digital twins of objects from single 2D images, generating realistic and interactive scenarios. It also introduces a spatial relation-aware code generation framework that combines object annotations with large language models to break down tasks, determine spatial constraints, and generate precise robotic movement code. Our framework offers a comprehensive benchmark with both simulated and real-world data, enabling standardized evaluation and better alignment between simulated training and real-world performance. We validated our approach using the open-source COBOT Magic Robot platform. Policies pre-trained on RoboTwin-generated data and fine-tuned with limited real-world samples demonstrate significant potential for enhancing dual-arm robotic manipulation systems by improving success rates by over 70% for single-arm tasks and over 40% for dual-arm tasks compared to models trained solely on real-world data.
Embodied-R: Collaborative Framework for Activating Embodied Spatial Reasoning in Foundation Models via Reinforcement Learning
Zhao, Baining, Wang, Ziyou, Fang, Jianjie, Gao, Chen, Man, Fanhang, Cui, Jinqiang, Wang, Xin, Chen, Xinlei, Li, Yong, Zhu, Wenwu
Humans can perceive and reason about spatial relationships from sequential visual observations, such as egocentric video streams. However, how pretrained models acquire such abilities, especially high-level reasoning, remains unclear. This paper introduces Embodied-R, a collaborative framework combining large-scale Vision-Language Models (VLMs) for perception and small-scale Language Models (LMs) for reasoning. Using Reinforcement Learning (RL) with a novel reward system considering think-answer logical consistency, the model achieves slow-thinking capabilities with limited computational resources. After training on only 5k embodied video samples, Embodied-R with a 3B LM matches state-of-the-art multimodal reasoning models (OpenAI-o1, Gemini-2.5-pro) on both in-distribution and out-of-distribution embodied spatial reasoning tasks. Embodied-R also exhibits emergent thinking patterns such as systematic analysis and contextual integration. We further explore research questions including response length, training on VLM, strategies for reward design, and differences in model generalization after SFT (Supervised Fine-Tuning) and RL training.
Geographical Context Matters: Bridging Fine and Coarse Spatial Information to Enhance Continental Land Cover Mapping
Ghassemi, Babak, Fraga-Dantas, Cassio, Gaetano, Raffaele, Ienco, Dino, Ghorbanzadeh, Omid, Izquierdo-Verdiguier, Emma, Vuolo, Francesco
Land use and land cover mapping from Earth Observation (EO) data is a critical tool for sustainable land and resource management. While advanced machine learning and deep learning algorithms excel at analyzing EO imagery data, they often overlook crucial geospatial metadata information that could enhance scalability and accuracy across regional, continental, and global scales. To address this limitation, we propose BRIDGE-LC (Bi-level Representation Integration for Disentangled GEospatial Land Cover), a novel deep learning framework that integrates multi-scale geospatial information into the land cover classification process. By simultaneously leveraging fine-grained (latitude/longitude) and coarse-grained (biogeographical region) spatial information, our lightweight multi-layer perceptron architecture learns from both during training but only requires fine-grained information for inference, allowing it to disentangle region-specific from region-agnostic land cover features while maintaining computational efficiency. To assess the quality of our framework, we use an open-access in-situ dataset and adopt several competing classification approaches commonly considered for large-scale land cover mapping. We evaluated all approaches through two scenarios: an extrapolation scenario in which training data encompasses samples from all biogeographical regions, and a leave-one-region-out scenario where one region is excluded from training. We also explore the spatial representation learned by our model, highlighting a connection between its internal manifold and the geographical information used during training. Our results demonstrate that integrating geospatial information improves land cover mapping performance, with the most substantial gains achieved by jointly leveraging both fine- and coarse-grained spatial information.
Can Moran Eigenvectors Improve Machine Learning of Spatial Data? Insights from Synthetic Data Validation
Moran Eigenvector Spatial Filtering (ESF) approaches have shown promise in accounting for spatial effects in statistical models. Can this extend to machine learning? This paper examines the effectiveness of using Moran Eigenvectors as additional spatial features in machine learning models. We generate synthetic datasets with known processes involving spatially varying and nonlinear effects across two different geometries. Moran Eigenvectors calculated from different spatial weights matrices, with and without a priori eigenvector selection, are tested. We assess the performance of popular machine learning models, including Random Forests, LightGBM, XGBoost, and TabNet, and benchmark their accuracies in terms of cross-validated R2 values against models that use only coordinates as features. We also extract coefficients and functions from the models using GeoShapley and compare them with the true processes. Results show that machine learning models using only location coordinates achieve better accuracies than eigenvector-based approaches across various experiments and datasets. Furthermore, we discuss that while these findings are relevant for spatial processes that exhibit positive spatial autocorrelation, they do not necessarily apply when modeling network autocorrelation and cases with negative spatial autocorrelation, where Moran Eigenvectors would still be useful.
Improved Visual-Spatial Reasoning via R1-Zero-Like Training
Liao, Zhenyi, Xie, Qingsong, Zhang, Yanhao, Kong, Zijian, Lu, Haonan, Yang, Zhenyu, Deng, Zhijie
Increasing attention has been placed on improving the reasoning capacities of multi-modal large language models (MLLMs). As the cornerstone for AI agents that function in the physical realm, video-based visual-spatial intelligence (VSI) emerges as one of the most pivotal reasoning capabilities of MLLMs. This work conducts a first, in-depth study on improving the visual-spatial reasoning of MLLMs via R1-Zero-like training. Technically, we first identify that the visual-spatial reasoning capacities of small- to medium-sized Qwen2-VL models cannot be activated via Chain of Thought (CoT) prompts. We then incorporate GRPO training for improved visual-spatial reasoning, using the carefully curated VSI-100k dataset, following DeepSeek-R1-Zero. During the investigation, we identify the necessity to keep the KL penalty (even with a small value) in GRPO. With just 120 GPU hours, our vsGRPO-2B model, fine-tuned from Qwen2-VL-2B, can outperform the base model by 12.1% and surpass GPT-4o. Moreover, our vsGRPO-7B model, fine-tuned from Qwen2-VL-7B, achieves performance comparable to that of the best open-source model LLaVA-NeXT-Video-72B. Additionally, we compare vsGRPO to supervised fine-tuning and direct preference optimization baselines and observe strong performance superiority. The code and dataset will be available soon.
Spatially Directional Dual-Attention GAT for Spatial Fluoride Health Risk Modeling
Environmental exposure to fluoride is a major public health concern, particularly in regions with naturally elevated fluoride concentrations. Accurate modeling of fluoride-related health risks, such as dental fluorosis, requires spatially aware learning frameworks capable of capturing both geographic and semantic heterogeneity. In this work, we propose Spatially Directional Dual-Attention Graph Attention Network (SDD-GAT), a novel spatial graph neural network designed for fine-grained health risk prediction. SDD-GAT introduces a dual-graph architecture that disentangles geographic proximity and attribute similarity, and incorporates a directional attention mechanism that explicitly encodes spatial orientation and distance into the message passing process. To further enhance spatial coherence, we introduce a spatial smoothness regularization term that enforces consistency in predictions across neighboring locations. We evaluate SDD-GAT on a large-scale dataset covering over 50,000 fluoride monitoring samples and fluorosis records across Guizhou Province, China. Results show that SDD-GAT significantly outperforms traditional models and state-of-the-art GNNs in both regression and classification tasks, while also exhibiting improved spatial autocorrelation as measured by Moran's I. Our framework provides a generalizable foundation for spatial health risk modeling and geospatial learning under complex environmental settings.
Endowing Embodied Agents with Spatial Reasoning Capabilities for Vision-and-Language Navigation
Enhancing the spatial perception capabilities of mobile robots is crucial for achieving embodied Vision-and-Language Navigation (VLN). Although significant progress has been made in simulated environments, directly transferring these capabilities to real-world scenarios often results in severe hallucination phenomena, causing robots to lose effective spatial awareness. To address this issue, we propose BrainNav, a bio-inspired spatial cognitive navigation framework inspired by biological spatial cognition theories and cognitive map theory. BrainNav integrates dual-map (coordinate map and topological map) and dual-orientation (relative orientation and absolute orientation) strategies, enabling real-time navigation through dynamic scene capture and path planning. Its five core modules-Hippocampal Memory Hub, Visual Cortex Perception Engine, Parietal Spatial Constructor, Prefrontal Decision Center, and Cerebellar Motion Execution Unit-mimic biological cognitive functions to reduce spatial hallucinations and enhance adaptability. Validated in a zero-shot real-world lab environment using the Limo Pro robot, BrainNav, compatible with GPT-4, outperforms existing State-of-the-Art (SOTA) Vision-and-Language Navigation in Continuous Environments (VLN-CE) methods without fine-tuning.
GIScience in the Era of Artificial Intelligence: A Research Agenda Towards Autonomous GIS
Li, Zhenlong, Ning, Huan, Gao, Song, Janowicz, Krzysztof, Li, Wenwen, Arundel, Samantha T., Yang, Chaowei, Bhaduri, Budhendra, Wang, Shaowen, Zhu, A-Xing, Gahegan, Mark, Shekhar, Shashi, Ye, Xinyue, McKenzie, Grant, Cervone, Guido, Hodgson, Michael E.
The advent of generative AI exemplified by large language models (LLMs) opens new ways to represent and compute geographic information and transcends the process of geographic knowledge production, driving geographic information systems (GIS) towards autonomous GIS. Leveraging LLMs as the decision core, autonomous GIS can independently generate and execute geoprocessing workflows to perform spatial analysis. In this vision paper, we further elaborate on the concept of autonomous GIS and present a conceptual framework that defines its five autonomous goals, five autonomous levels, five core functions, and three operational scales. We demonstrate how autonomous GIS could perform geospatial data retrieval, spatial analysis, and map making with four proof-of-concept GIS agents. We conclude by identifying critical challenges and future research directions, including fine-tuning and self-growing decision-cores, autonomous modeling, and examining the societal and practical implications of autonomous GIS. By establishing the groundwork for a paradigm shift in GIScience, this paper envisions a future where GIS moves beyond traditional workflows to autonomously reason, derive, innovate, and advance geospatial solutions to pressing global challenges. Meanwhile, as we design and deploy increasingly intelligent geospatial systems, we carry a responsibility to ensure they are developed in socially responsible ways, serve the public good, and support the continued value of human geographic insight in an AI-augmented future.
STEI-PCN: an efficient pure convolutional network for traffic prediction via spatial-temporal encoding and inferring
Hu, Kai, Zhao, Zhidan, Hao, Zhifeng
STEI-PCN: an efficient pure convolutional network for traffic prediction via spatial-temporal encoding and inferring Kai Hu a, Zhidan Zhao b,c,, Zhifeng Hao a a Department of Mathematic, School of Mathematics and Computer Sciences, Shantou University, Shantou, 515063, Guangdong, China b Department of Computer Science, School of Mathematics and Computer Sciences, Shantou University, Shantou, 515063, Guangdong, China c Complexity Computation Laboratory, Department of Computer Science, School of Mathematics and Computer Sciences, Shantou University, Shantou, 515603, Guangdong, ChinaAbstract Traffic data exhibits complex temporal, spatial, and spatial-temporal correlations. Capturing and integrating these correlations is crucial for building accurate prediction models. Although numerous deep learning-based traffic prediction models have been developed, most of these models use either independent modules to separately extract temporal and spatial correlations or joint modules to synchronously extract them, without considering the spatial-temporal correlations. Moreover, models that consider joint spatial-temporal correlations (temporal, spatial, and spatial-temporal correlations) often encounter significant challenges in accuracy and computational efficiency which prevent such models from demonstrating the expected advantages of a joint spatial-temporal correlations architecture. To address these issues, this paper proposes an efficient pure convolutional network for traffic prediction via spatial-temporal encoding and inferring (STEI-PCN). The model introduces and designs a dynamic adjacency matrix inferring module based on absolute spatial and temporal coordinates, as well as relative spa-Corresponding author at: Department of Computer Science, School of Mathematics and Computer Sciences, Shantou University, Shantou, 515063, Guangdong, China and Complexity Computation Laboratory, Department of Computer Science, School of Mathematics and Computer Sciences, Shantou University, Shantou, 515603, Guangdong, China.