Spatial Reasoning
Bregman-Hausdorff divergence: strengthening the connections between computational geometry and machine learning
Pham, Tuyen, Kouřimská, Hana Dal Poz, Wagner, Hubert
The purpose of this paper is twofold. On a technical side, we propose an extension of the Hausdorff distance from metric spaces to spaces equipped with asymmetric distance measures. Specifically, we focus on the family of Bregman divergences, which includes the popular Kullback--Leibler divergence (also known as relative entropy). As a proof of concept, we use the resulting Bregman--Hausdorff divergence to compare two collections of probabilistic predictions produced by different machine learning models trained using the relative entropy loss. The algorithms we propose are surprisingly efficient even for large inputs with hundreds of dimensions. In addition to the introduction of this technical concept, we provide a survey. It outlines the basics of Bregman geometry, as well as computational geometry algorithms. We focus on algorithms that are compatible with this geometry and are relevant for machine learning.
MM-STFlowNet: A Transportation Hub-Oriented Multi-Mode Passenger Flow Prediction Method via Spatial-Temporal Dynamic Graph Modeling
Zhang, Ronghui, Xing, Wenbin, Li, Mengran, Wang, Zihan, Chen, Junzhou, Ma, Xiaolei, Liu, Zhiyuan, He, Zhengbing
Accurate and refined passenger flow prediction is essential for optimizing the collaborative management of multiple collection and distribution modes in large-scale transportation hubs. Traditional methods often focus only on the overall passenger volume, neglecting the interdependence between different modes within the hub. To address this limitation, we propose MM-STFlowNet, a comprehensive multi-mode prediction framework grounded in dynamic spatial-temporal graph modeling. Initially, an integrated temporal feature processing strategy is implemented using signal decomposition and convolution techniques to address data spikes and high volatility. Subsequently, we introduce the Spatial-Temporal Dynamic Graph Convolutional Recurrent Network (STDGCRN) to capture detailed spatial-temporal dependencies across multiple traffic modes, enhanced by an adaptive channel attention mechanism. Finally, the self-attention mechanism is applied to incorporate various external factors, further enhancing prediction accuracy. Experiments on a real-world dataset from Guangzhounan Railway Station in China demonstrate that MM-STFlowNet achieves state-of-the-art performance, particularly during peak periods, providing valuable insight for transportation hub management.
How to Enable LLM with 3D Capacity? A Survey of Spatial Reasoning in LLM
Zha, Jirong, Fan, Yuxuan, Yang, Xiao, Gao, Chen, Chen, Xinlei
3D spatial understanding is essential in real-world applications such as robotics, autonomous vehicles, virtual reality, and medical imaging. Recently, Large Language Models (LLMs), having demonstrated remarkable success across various domains, have been leveraged to enhance 3D understanding tasks, showing potential to surpass traditional computer vision methods. In this survey, we present a comprehensive review of methods integrating LLMs with 3D spatial understanding. We propose a taxonomy that categorizes existing methods into three branches: image-based methods deriving 3D understanding from 2D visual data, point cloud-based methods working directly with 3D representations, and hybrid modality-based methods combining multiple data streams. We systematically review representative methods along these categories, covering data representations, architectural modifications, and training strategies that bridge textual and 3D modalities. Finally, we discuss current limitations, including dataset scarcity and computational challenges, while highlighting promising research directions in spatial perception, multi-modal fusion, and real-world applications.
#AAAI2025 invited talk round-up 1: labour economics, and reasoning about spatial information
The 39th Annual AAAI Conference on Artificial Intelligence (AAAI 2025) took place in Philadelphia from Tuesday 25 February to Tuesday 4 March 2025. The programme featured eight invited talks. Susan works at the intersection of computer science and economics. In the past she has researched problems relating to mechanism design, auctions, pricing, and causal inference, but recently she has turned her attention to modelling worker career transitions using transformer models. In her talk, Susan described the research in a few of her recent papers covering topics such as the gender wage gap and economic prediction of labour sequence data.
Skeletonization Quality Evaluation: Geometric Metrics for Point Cloud Analysis in Robotics
Wen, Qingmeng, Lai, Yu-Kun, Ji, Ze, Tafrishi, Seyed Amir
Skeletonization is a powerful tool for shape analysis, rooted in the inherent instinct to understand an object's morphology. It has found applications across various domains, including robotics. Although skeletonization algorithms have been studied in recent years, their performance is rarely quantified with detailed numerical evaluations. This work focuses on defining and quantifying geometric properties to systematically score the skeletonization results of point cloud shapes across multiple aspects, including topological similarity, boundedness, centeredness, and smoothness. We introduce these representative metric definitions along with a numerical scoring framework to analyze skeletonization outcomes concerning point cloud data for different scenarios, from object manipulation to mobile robot navigation. Additionally, we provide an open-source tool to enable the research community to evaluate and refine their skeleton models. Finally, we assess the performance and sensitivity of the proposed geometric evaluation methods from various robotic applications.
STSA: Spatial-Temporal Semantic Alignment for Visual Dubbing
Ding, Zijun, Xiong, Mingdie, Zhu, Congcong, Chen, Jingrun
--Existing audio-driven visual dubbing methods have achieved great success. Despite this, we observe that the semantic ambiguity between spatial and temporal domains significantly degrades the synthesis stability for the dynamic faces. We argue that aligning the semantic features from spatial and temporal domains is a promising approach to stabilizing facial motion. T o achieve this, we propose a Spatial-T emporal Semantic Alignment (STSA) method, which introduces a dual-path alignment mechanism and a differentiable semantic representation. The former leverages a Consistent Information Learning (CIL) module to maximize the mutual information at multiple scales, thereby reducing the manifold differences between spatial and temporal domains. The latter utilizes probabilistic heatmap as ambiguity-tolerant guidance to avoid the abnormal dynamics of the synthesized faces caused by slight semantic jittering. Extensive experimental results demonstrate the superiority of the proposed STSA, especially in terms of image quality and synthesis stability.
DeepRV: pre-trained spatial priors for accelerated disease mapping
Navott, Jhonathan, Jenson, Daniel, Flaxman, Seth, Semenova, Elizaveta
Recently introduced prior-encoding deep generative models (e.g., PriorVAE, $\pi$VAE, and PriorCVAE) have emerged as powerful tools for scalable Bayesian inference by emulating complex stochastic processes like Gaussian processes (GPs). However, these methods remain largely a proof-of-concept and inaccessible to practitioners. We propose DeepRV, a lightweight, decoder-only approach that accelerates training, and enhances real-world applicability in comparison to current VAE-based prior encoding approaches. Leveraging probabilistic programming frameworks (e.g., NumPyro) for inference, DeepRV achieves significant speedups while also improving the quality of parameter inference, closely matching full MCMC sampling. We showcase its effectiveness in process emulation and spatial analysis of the UK using simulated data, gender-wise cancer mortality rates for individuals under 50, and HIV prevalence in Zimbabwe. To bridge the gap between theory and practice, we provide a user-friendly API, enabling scalable and efficient Bayesian inference.
Robust Flower Cluster Matching Using The Unscented Transform
Chu, Andy, Shrestha, Rashik, Gu, Yu, Gross, Jason N.
-- Monitoring flowers over time is essential for precision robotic pollination in agriculture. T o accomplish this, a continuous spatial-temporal observation of plant growth can be done using stationary RGB-D cameras. However, image registration becomes a serious challenge due to changes in the visual appearance of the plant caused by the pollination process and occlusions from growth and camera angles. Plants flower in a manner that produces distinct clusters on branches. This paper presents a method for matching flower clusters using descriptors generated from RGB-D data and considers allowing for spatial uncertainty within the cluster . The proposed approach leverages the Unscented Transform to efficiently estimate plant descriptor uncertainty tolerances, enabling a robust image-registration process despite temporal changes. The Unscented Transform is used to handle the nonlinear transformations by propagating the uncertainty of flower positions to determine the variations in the descriptor domain. A Monte Carlo simulation is used to validate the Unscented Transform results, confirming our method's effectiveness for flower cluster matching. Therefore, it can facilitate improved robotics pollination in dynamic environments. Although global agriculture relies heavily on pollination, evidence has shown that the population of natural pollinators is decreasing, raising concerns about food and the economy [1].
Geographical hotspot prediction based on point cloud-voxel-community partition clustering
Existing solutions to the hotspot prediction problem in the field of geographic information remain at a relatively preliminary stage. This study presents a novel approach for detecting and predicting geographical hotspots, utilizing point cloud-voxel-community partition clustering. By analyzing high-dimensional data, we represent spatial information through point clouds, which are then subdivided into multiple voxels to enhance analytical efficiency. Our method identifies spatial voxels with similar characteristics through community partitioning, thereby revealing underlying patterns in hotspot distributions. Experimental results indicate that when applied to a dataset of archaeological sites in Turkey, our approach achieves a 19.31% increase in processing speed, with an accuracy loss of merely 6%, outperforming traditional clustering methods. This method not only provides a fresh perspective for hotspot prediction but also serves as an effective tool for high-dimensional data analysis.
A Spatial-temporal Deep Probabilistic Diffusion Model for Reliable Hail Nowcasting with Radar Echo Extrapolation
Shi, Haonan, Tian, Long, Tao, Jie, Li, Yufei, Wang, Liming, Liu, Xiyang
Hail nowcasting is a considerable contributor to meteorological disasters and there is a great need to mitigate its socioeconomic effects through precise forecast that has high resolution, long lead times and local details with large landscapes. Existing medium-range weather forecasting methods primarily rely on changes in upper air currents and cloud layers to predict precipitation events, such as heavy rainfall, which are unsuitable for hail nowcasting since it is mainly caused by low-altitude local strong convection associated with terrains. Additionally, radar captures the status of low cloud layers, such as water vapor, droplets, and ice crystals, providing rich signals suitable for hail nowcasting. To this end, we introduce a Spatial-Temporal gEnerAtive Model called SteamCast for hail nowcasting with radar echo extrapolation, it is a deep probabilistic diffusion model based on spatial-temporal representations including radar echoes as well as their position/time embeddings, which we trained on historical reanalysis archive from Yan'an Meteorological Bureau in China, where the crop yield like apple suffers greatly from hail damage. Considering the short-term nature of hail, SteamCast provides 30-minute nowcasts at 6-minute intervals for a single radar reflectivity variable, across 9 different vertical angles, on a latitude-longitude grid with approximately 1 km * 1 km resolution per pixel in Yan'an City, China. By successfully fusing the spatial-temporal features of radar echoes, SteamCast delivers competitive, and in some cases superior, results compared to other deep learning-based models such as PredRNN and VMRNN.