Goto

Collaborating Authors

 Spatial Reasoning


Towards Scalable Spatial Intelligence via 2D-to-3D Data Lifting

arXiv.org Artificial Intelligence

Spatial intelligence is emerging as a transformative frontier in AI, yet it remains constrained by the scarcity of large-scale 3D datasets. Unlike the abundant 2D imagery, acquiring 3D data typically requires specialized sensors and laborious annotation. In this work, we present a scalable pipeline that converts single-view images into comprehensive, scale- and appearance-realistic 3D representations - including point clouds, camera poses, depth maps, and pseudo-RGBD - via integrated depth estimation, camera calibration, and scale calibration. Our method bridges the gap between the vast repository of imagery and the increasing demand for spatial scene understanding. By automatically generating authentic, scale-aware 3D data from images, we significantly reduce data collection costs and open new avenues for advancing spatial intelligence. We release two generated spatial datasets, i.e., COCO-3D and Objects365-v2-3D, and demonstrate through extensive experiments that our generated data can benefit various 3D tasks, ranging from fundamental perception to MLLM-based reasoning. These results validate our pipeline as an effective solution for developing AI systems capable of perceiving, understanding, and interacting with physical environments.


Reality Proxy: Fluid Interactions with Real-World Objects in MR via Abstract Representations

arXiv.org Artificial Intelligence

Interacting with real-world objects in Mixed Reality (MR) often proves difficult when they are crowded, distant, or partially occluded, hindering straightforward selection and manipulation. We observe that these difficulties stem from performing interaction directly on physical objects, where input is tightly coupled to their physical constraints. Our key insight is to decouple interaction from these constraints by introducing proxies-abstract representations of real-world objects. We embody this concept in Reality Proxy, a system that seamlessly shifts interaction targets from physical objects to their proxies during selection. Beyond facilitating basic selection, Reality Proxy uses AI to enrich proxies with semantic attributes and hierarchical spatial relationships of their corresponding physical objects, enabling novel and previously cumbersome interactions in MR - such as skimming, attribute-based filtering, navigating nested groups, and complex multi object selections - all without requiring new gestures or menu systems. We demonstrate Reality Proxy's versatility across diverse scenarios, including office information retrieval, large-scale spatial navigation, and multi-drone control. An expert evaluation suggests the system's utility and usability, suggesting that proxy-based abstractions offer a powerful and generalizable interaction paradigm for future MR systems.


Sparser2Sparse: Single-shot Sparser-to-Sparse Learning for Spatial Transcriptomics Imputation with Natural Image Co-learning

arXiv.org Artificial Intelligence

Spatial transcriptomics (ST) has revolutionized biomedical research by enabling high resolution gene expression profiling within tissues. However, the high cost and scarcity of high resolution ST data remain significant challenges. We present Single-shot Sparser-to-Sparse (S2S-ST), a novel framework for accurate ST imputation that requires only a single and low-cost sparsely sampled ST dataset alongside widely available natural images for co-training. Our approach integrates three key innovations: (1) a sparser-to-sparse self-supervised learning strategy that leverages intrinsic spatial patterns in ST data, (2) cross-domain co-learning with natural images to enhance feature representation, and (3) a Cascaded Data Consistent Imputation Network (CDCIN) that iteratively refines predictions while preserving sampled gene data fidelity. Extensive experiments on diverse tissue types, including breast cancer, liver, and lymphoid tissue, demonstrate that our method outperforms state-of-the-art approaches in imputation accuracy. By enabling robust ST reconstruction from sparse inputs, our framework significantly reduces reliance on costly high resolution data, facilitating potential broader adoption in biomedical research and clinical applications. Keywords: Spatial Transcriptomics, Gene Expression Imputation, Single-shot Learning, Natural Image Co-training, Cost Reduction 1. Introduction Spatial transcriptomics (ST) is a cutting-edge technology that enables the investigation of spatially resolved gene expression within tissues (Asp et al., 2020). Traditional transcriptomic approaches, such as single-cell RNA sequencing (scRNA-seq), provide high-throughput, high resolution gene expression profiles but inherently lack spatial context (Aung et al., 2024; Boe et al., 2024; Sankar et al., 2024). However, spatial information is crucial for identifying disease biomarkers, understanding disease progression, and developing personalized treatment strategies.


MSGM: A Multi-Scale Spatiotemporal Graph Mamba for EEG Emotion Recognition

arXiv.org Artificial Intelligence

--EEG-based emotion recognition struggles with capturing multi-scale spatiotemporal dynamics and ensuring computational efficiency for real-time applications. T o overcome these challenges, we propose the Multi-Scale Spatiotemporal Graph Mamba (MSGM), a novel framework integrating multi-window temporal segmentation, bimodal spatial graph modeling, and efficient fusion via the Mamba architecture. A multi-depth Graph Convolutional Network (GCN) and token embedding fusion module, paired with Mamba's state-space modeling, enable dynamic spatiotemporal interaction at linear complexity. MOTION recognition has emerged as a critical research frontier with far-reaching implications for human-computer interaction, mental health monitoring, and neurosci-entific exploration [1] [2] [3]. The ability to decode emotional states in real-time promises to revolutionize intelligent systems by enhancing user adaptability and bolstering clinical applications through early detection and management of emotional disorders [4] [5]. As these capabilities become increasingly vital in healthcare and artificial intelligence, there is an urgent need for robust, efficient, and neurophysiologically grounded approaches to overcome both theoretical complexities and practical deployment challenges [6]. Electroencephalography (EEG) stands out as a premier modality for emotion recognition, owing to its unparalleled capacity to non-invasively record brain activity with high temporal resolution, directly capturing the neural signatures of emotional processes [7]. Hanwen Liu and Yifeng Gong are with the School of Electronics and Communication Engineering, Sun Y at-sen University, Shenzhen, 518107, China, e-mail: (liuhw56, gongyf9)@mail2.sysu.edu.cn. Zuwei Y an is with the College of Communication Engineering, Jilin University, Changchun, 130012, China, e-mail: yanzw2422@mails.jlu.edu.cn.


Spatio-Temporal Demand Prediction for Food Delivery Using Attention-Driven Graph Neural Networks

arXiv.org Artificial Intelligence

Accurate demand forecasting is critical for enhancing the efficiency and responsiveness of food delivery platforms, where spatial heterogeneity and temporal fluctuations in order volumes directly influence operational decisions. This paper proposes an attention-based Graph Neural Network framework that captures spatial-temporal dependencies by modeling the food delivery environment as a graph. In this graph, nodes represent urban delivery zones, while edges reflect spatial proximity and inter-regional order flow patterns derived from historical data. The attention mechanism dynamically weighs the influence of neighboring zones, enabling the model to focus on the most contextually relevant areas during prediction. Temporal trends are jointly learned alongside spatial interactions, allowing the model to adapt to evolving demand patterns. Extensive experiments on real-world food delivery datasets demonstrate the superiority of the proposed model in forecasting future order volumes with high accuracy. The framework offers a scalable and adaptive solution to support proactive fleet positioning, resource allocation, and dispatch optimization in urban food delivery operations.


Topological Social Choice: Designing a Noise-Robust Polar Distance for Persistence Diagrams

arXiv.org Artificial Intelligence

Topological Data Analysis (TDA) has emerged as a powerful framework for extracting robust and interpretable features from noisy high-dimensional data. In the context of Social Choice Theory, where preference profiles and collective decisions are geometrically rich yet sensitive to perturbations, TDA remains largely unexplored. This work introduces a novel conceptual bridge between these domains by proposing a new metric framework for persistence diagrams tailored to noisy preference data.We define a polar coordinate-based distance that captures both the magnitude and orientation of topological features in a smooth and differentiable manner. Our metric addresses key limitations of classical distances, such as bottleneck and Wasserstein, including instability under perturbation, lack of continuity, and incompatibility with gradient-based learning. The resulting formulation offers improved behavior in both theoretical and applied settings.To the best of our knowledge, this is the first study to systematically apply persistent homology to social choice systems, providing a mathematically grounded method for comparing topological summaries of voting structures and preference dynamics. We demonstrate the superiority of our approach through extensive experiments, including robustness tests and supervised learning tasks, and we propose a modular pipeline for building predictive models from online preference data. This work contributes a conceptually novel and computationally effective tool to the emerging interface of topology and decision theory, opening new directions in interpretable machine learning for political and economic systems.


Enhancing Spatial Reasoning in Vision-Language Models via Chain-of-Thought Prompting and Reinforcement Learning

arXiv.org Artificial Intelligence

This study investigates the spatial reasoning capabilities of vision-language models (VLMs) through Chain-of-Thought (CoT) prompting and reinforcement learning. We begin by evaluating the impact of different prompting strategies and find that simple CoT formats, where the model generates a reasoning step before the answer, not only fail to help, but can even harm the model's original performance. In contrast, structured multi-stage prompting based on scene graphs (SceneGraph CoT) significantly improves spatial reasoning accuracy. Furthermore, to improve spatial reasoning ability, we fine-tune models using Group Relative Policy Optimization (GRPO) on the SAT dataset and evaluate their performance on CVBench. Compared to supervised fine-tuning (SFT), GRPO achieves higher accuracy on Pass@1 evaluations and demonstrates superior robustness under out-of-distribution (OOD) conditions. In particular, we find that SFT overfits to surface-level linguistic patterns and may degrade performance when test-time phrasing changes (e.g., from "closer to" to "farther from"). GRPO, on the other hand, generalizes more reliably and maintains stable performance under such shifts. Our findings provide insights into how reinforcement learning and structured prompting improve the spatial reasoning capabilities and generalization behavior of modern VLMs. All code is open source at: https://github.com/Yvonne511/spatial-vlm-investigator


Transforming Football Data into Object-centric Event Logs with Spatial Context Information

arXiv.org Artificial Intelligence

Object-centric event logs expand the conventional single-case notion event log by considering multiple objects, allowing for the analysis of more complex and realistic process behavior. However, the number of real-world object-centric event logs remains limited, and further studies are needed to test their usefulness. The increasing availability of data from team sports can facilitate object-centric process mining, leveraging both real-world data and suitable use cases. In this paper, we present a framework for transforming football (soccer) data into an object-centric event log, further enhanced with a spatial dimension. We demonstrate the effectiveness of our framework by generating object-centric event logs based on real-world football data and discuss the results for varying process representations. With our paper, we provide the first example for object-centric event logs in football analytics. Future work should consider variant analysis and filtering techniques to better handle variability.


Perspective-Aware AI in Extended Reality

arXiv.org Artificial Intelligence

AI-enhanced Extended Reality (XR) aims to deliver adaptive, immersive experiences--yet current systems fall short due to shallow user modeling and limited cognitive context. We introduce Perspective-Aware AI in Extended Reality (PAiR), a foundational framework for integrating Perspective-Aware AI (PAi) with XR to enable interpretable, context-aware experiences grounded in user identity. PAi is built on Chronicles--reasoning-ready identity models learned from multimodal digital footprints that capture users' cognitive and experiential evolution. PAiR employs these models in a closed-loop system linking dynamic user states with immersive environments. We present PAiR's architecture, detailing its modules and system flow, and demonstrate its utility through two proof-of-concept scenarios implemented in the Unity-based Open-Dome engine. PAiR opens a new direction for human-AI interaction by embedding perspective-based identity models into immersive systems.


STRAP: Spatial-Temporal Risk-Attentive Vehicle Trajectory Prediction for Autonomous Driving

arXiv.org Artificial Intelligence

Accurate vehicle trajectory prediction is essential for ensuring safety and efficiency in fully autonomous driving systems. While existing methods primarily focus on modeling observed motion patterns and interactions with other vehicles, they often neglect the potential risks posed by the uncertain or aggressive behaviors of surrounding vehicles. In this paper, we propose a novel spatial-temporal risk-attentive trajectory prediction framework that incorporates a risk potential field to assess perceived risks arising from behaviors of nearby vehicles. The framework leverages a spatial-temporal encoder and a risk-attentive feature fusion decoder to embed the risk potential field into the extracted spatial-temporal feature representations for trajectory prediction. A risk-scaled loss function is further designed to improve the prediction accuracy of high-risk scenarios, such as short relative spacing. Experiments on the widely used NGSIM and HighD datasets demonstrate that our method reduces average prediction errors by 4.8% and 31.2% respectively compared to state-of-the-art approaches, especially in high-risk scenarios. The proposed framework provides interpretable, risk-aware predictions, contributing to more robust decision-making for autonomous driving systems.