Goto

Collaborating Authors

 Spatial Reasoning


From Street Form to Spatial Justice: Explaining Urban Exercise Inequality via a Triadic SHAP-Informed Framework

arXiv.org Artificial Intelligence

Urban streets are essential public spaces that facilitate everyday physical activity and promote health equity. Drawing on Henri Lefebvre's spatial triad, this study proposes a conceptual and methodological framework to quantify street-level exercise deprivation through the dimensions of conceived (planning and structure), perceived (visual and sensory), and lived (practice and experiential) urban spaces. We integrate multi-source spatial data-including street networks, street-view imagery, and social media-using explainable machine learning (SHAP analysis) to classify streets by their dominant deprivation modes, forming a novel typology of spatial inequity. Results highlight significant differences across urban contexts: older city cores predominantly experience infrastructural constraints (conceived space), whereas new development areas suffer from experiential disengagement (lived space). Furthermore, by identifying spatial mismatches between population distribution and exercise intensity, our study reveals localized clusters of latent deprivation. Simulation experiments demonstrate that targeted improvements across spatial dimensions can yield up to 14% increases in exercise supportiveness. This research not only operationalizes Lefebvre's spatial theory at the street scale but also provides actionable insights and intervention guidelines, contributing to the broader goals of spatial justice and urban health equity.


Adaptive Gate-Aware Mamba Networks for Magnetic Resonance Fingerprinting

arXiv.org Artificial Intelligence

Magnetic Resonance Fingerprinting (MRF) enables fast quantitative imaging by matching signal evolutions to a predefined dictionary. However, conventional dictionary matching suffers from exponential growth in computational cost and memory usage as the number of parameters increases, limiting its scalability to multi-parametric mapping. To address this, recent work has explored deep learning-based approaches as alternatives to DM. We propose GAST-Mamba, an end-to-end framework that combines a dual Mamba-based encoder with a Gate-Aware Spatial-Temporal (GAST) processor. Built on structured state-space models, our architecture efficiently captures long-range spatial dependencies with linear complexity. On 5 times accelerated simulated MRF data (200 frames), GAST-Mamba achieved a T1 PSNR of 33.12~dB, outperforming SCQ (31.69~dB). For T2 mapping, it reached a PSNR of 30.62~dB and SSIM of 0.9124. In vivo experiments further demonstrated improved anatomical detail and reduced artifacts. Ablation studies confirmed that each component contributes to performance, with the GAST module being particularly important under strong undersampling. These results demonstrate the effectiveness of GAST-Mamba for accurate and robust reconstruction from highly undersampled MRF acquisitions, offering a scalable alternative to traditional DM-based methods.


Spatial and Semantic Embedding Integration for Stereo Sound Event Localization and Detection in Regular Videos

arXiv.org Artificial Intelligence

This report presents our systems submitted to the audio-only and audio-visual tracks of the DCASE2025 Task 3 Challenge: Stereo Sound Event Localization and Detection (SELD) in Regular Video Content. SELD is a complex task that combines temporal event classification with spatial localization, requiring reasoning across spatial, temporal, and semantic dimensions. The last is arguably the most challenging to model. Traditional SELD architectures rely on multichannel input, which limits their ability to leverage large-scale pre-training due to data constraints. To address this, we enhance standard SELD architectures with semantic information by integrating pre-trained, contrastive language-aligned models: CLAP for audio and OWL-ViT for visual inputs. These embeddings are incorporated into a modified Conformer module tailored for multimodal fusion, which we refer to as the Cross-Modal Conformer. Additionally, we incorporate autocorrelation-based acoustic features to improve distance estimation. We pre-train our models on curated synthetic audio and audio-visual datasets and apply a left-right channel swapping augmentation to further increase the training data. Both our audio-only and audio-visual systems substantially outperform the challenge baselines on the development set, demonstrating the effectiveness of our strategy. Performance is further improved through model ensembling and a visual post-processing step based on human keypoints. Future work will investigate the contribution of each modality and explore architectural variants to further enhance results.


SPATIA: Multimodal Model for Prediction and Generation of Spatial Cell Phenotypes

arXiv.org Artificial Intelligence

Understanding how cellular morphology, gene expression, and spatial organization jointly shape tissue function is a central challenge in biology. Image-based spatial transcriptomics technologies now provide high-resolution measurements of cell images and gene expression profiles, but machine learning methods typically analyze these modalities in isolation or at limited resolution. We address the problem of learning unified, spatially aware representations that integrate cell morphology, gene expression, and spatial context across biological scales. This requires models that can operate at single-cell resolution, reason across spatial neighborhoods, and generalize to whole-slide tissue organization. Here, we introduce SPATIA, a multi-scale generative and predictive model for spatial transcriptomics. SPATIA learns cell-level embeddings by fusing image-derived morphological tokens and transcriptomic vector tokens using cross-attention and then aggregates them at niche and tissue levels using transformer modules to capture spatial dependencies. SPATIA incorporates token merging in its generative diffusion decoder to synthesize high-resolution cell images conditioned on gene expression. We assembled a multi-scale dataset consisting of 17 million cell-gene pairs, 1 million niche-gene pairs, and 10,000 tissue-gene pairs across 49 donors, 17 tissue types, and 12 disease states. We benchmark SPATIA against 13 existing models across 12 individual tasks, which span several categories including cell annotation, cell clustering, gene imputation, cross-modal prediction, and image generation. SPATIA achieves improved performance over all baselines and generates realistic cell morphologies that reflect transcriptomic perturbations.


Grounded Gesture Generation: Language, Motion, and Space

arXiv.org Artificial Intelligence

Human motion generation has advanced rapidly in recent years, yet the critical problem of creating spatially grounded, context-aware gestures has been largely overlooked. Existing models typically specialize either in descriptive motion generation, such as locomotion and object interaction, or in isolated co-speech gesture synthesis aligned with utterance semantics. However, both lines of work often treat motion and environmental grounding separately, limiting advances toward embodied, communicative agents. T o address this gap, our work introduces a multi-modal dataset and framework for grounded gesture generation, combining two key resources: (1) a synthetic dataset of spatially grounded referential gestures, and (2) MM-Conv, a VR-based dataset capturing two-party dialogues. T ogether, they provide over 7.7 hours of synchronized motion, speech, and 3D scene information, standardized in the HumanML3D format. Our framework further connects to a physics-based simulator, enabling synthetic data generation and situated evaluation. By bridging gesture modeling and spatial grounding, our contribution establishes a foundation for advancing research in situated gesture generation and grounded multimodal interaction.


Bio-Inspired Hybrid Map: Spatial Implicit Local Frames and Topological Map for Mobile Cobot Navigation

arXiv.org Artificial Intelligence

Navigation is a fundamental capacity for mobile robots, enabling them to operate autonomously in complex and dynamic environments. Conventional approaches use probabilistic models to localize robots and build maps simultaneously using sensor observations. Recent approaches employ human-inspired learning, such as imitation and reinforcement learning, to navigate robots more effectively. However, these methods suffer from high computational costs, global map inconsistency, and poor generalization to unseen environments. This paper presents a novel method inspired by how humans perceive and navigate themselves effectively in novel environments. Specifically, we first build local frames that mimic how humans represent essential spatial information in the short term. Points in local frames are hybrid representations, including spatial information and learned features, so-called spatial-implicit local frames. Then, we integrate spatial-implicit local frames into the global topological map represented as a factor graph. Lastly, we developed a novel navigation algorithm based on Rapid-Exploring Random Tree Star (RRT*) that leverages spatial-implicit local frames and the topological map to navigate effectively in environments. To validate our approach, we conduct extensive experiments in real-world datasets and in-lab environments. We open our source code at https://github.com/tuantdang/simn}{https://github.com/tuantdang/simn.


Transformer with Koopman-Enhanced Graph Convolutional Network for Spatiotemporal Dynamics Forecasting

arXiv.org Machine Learning

Spatiotemporal dynamics forecasting is inherently challenging, particularly in systems defined over irregular geometric domains, due to the need to jointly capture complex spatial correlations and nonlinear temporal dynamics. To tackle these challenges, we propose TK-GCN, a two-stage framework that integrates geometry-aware spatial encoding with long-range temporal modeling. In the first stage, a Koopman-enhanced Graph Convolutional Network (K-GCN) is developed to embed the high-dimensional dynamics distributed on spatially irregular domains into a latent space where the evolution of system states is approximately linear. By leveraging Koopman operator theory, this stage enhances the temporal consistency during the latent learning. In the second stage, a Transformer module is employed to model the temporal progression within the Koopman-encoded latent space. Through the self-attention mechanism, the Transformer captures long-range temporal dependencies, enabling accurate forecasting over extended horizons. We evaluate TK-GCN in spatiotemporal cardiac dynamics forecasting and benchmark its performance against several state-of-the-art baselines. Experimental results and ablation studies show that TK-GCN consistently delivers superior predictive accuracy across a range of forecast horizons, demonstrating its capability to effectively model complex spatial structures and nonlinear temporal dynamics.


Do Tensorized Large-Scale Spatiotemporal Dynamic Atmospheric Data Exhibit Low-Rank Properties?

arXiv.org Artificial Intelligence

In this study, we investigate for the first time the low-rank properties of a tensorized large-scale spatio-temporal dynamic atmospheric variable. We focus on the Sentinel-5P tropospheric NO2 product (S5P-TN) over a four-year period in an area that encompasses the contiguous United States (CONUS). Here, it is demonstrated that a low-rank approximation of such a dynamic variable is feasible. We apply the low-rank properties of the S5P-TN data to inpaint gaps in the Sentinel-5P product by adopting a low-rank tensor model (LRTM) based on the CANDECOMP / PARAFAC (CP) decomposition and alternating least squares (ALS). Furthermore, we evaluate the LRTM's results by comparing them with spatial interpolation using geostatistics, and conduct a comprehensive spatial statistical and temporal analysis of the S5P-TN product. The results of this study demonstrated that the tensor completion successfully reconstructs the missing values in the S5P-TN product, particularly in the presence of extended cloud obscuration, predicting outliers and identifying hotspots, when the data is tensorized over extended spatial and temporal scales.


MARVIS: Modality Adaptive Reasoning over VISualizations

arXiv.org Artificial Intelligence

Scientific applications of machine learning often rely on small, specialized models tuned to particular domains. Such models often achieve excellent performance, but lack flexibility. Foundation models offer versatility, but typically underperform specialized approaches, especially on non-traditional modalities and long-tail domains. We propose MARVIS (Modality Adaptive Reasoning over VISualizations), a training-free method that enables even small vision-language models to predict any data modality with high accuracy. MARVIS transforms latent embedding spaces into visual representations and then leverages the spatial and fine-grained reasoning skills of VLMs to successfully interpret and utilize them. MARVIS achieves competitive performance on vision, audio, biological, and tabular domains using a single 3B parameter model, achieving results that beat Gemini by 16\% on average and approach specialized methods, without exposing personally identifiable information (P.I.I.) or requiring any domain-specific training. We open source our code and datasets at https://github.com/penfever/marvis


A Survey on Vision-Language-Action Models: An Action Tokenization Perspective

arXiv.org Artificial Intelligence

The remarkable advancements of vision and language foundation models in multimodal understanding, reasoning, and generation has sparked growing efforts to extend such intelligence to the physical world, fueling the flourishing of vision-language-action (VLA) models. Despite seemingly diverse approaches, we observe that current VLA models can be unified under a single framework: vision and language inputs are processed by a series of VLA modules, producing a chain of \textit{action tokens} that progressively encode more grounded and actionable information, ultimately generating executable actions. We further determine that the primary design choice distinguishing VLA models lies in how action tokens are formulated, which can be categorized into language description, code, affordance, trajectory, goal state, latent representation, raw action, and reasoning. However, there remains a lack of comprehensive understanding regarding action tokens, significantly impeding effective VLA development and obscuring future directions. Therefore, this survey aims to categorize and interpret existing VLA research through the lens of action tokenization, distill the strengths and limitations of each token type, and identify areas for improvement. Through this systematic review and analysis, we offer a synthesized outlook on the broader evolution of VLA models, highlight underexplored yet promising directions, and contribute guidance for future research, hoping to bring the field closer to general-purpose intelligence.