Spatial Reasoning
Typed Topological Structures Of Datasets
A datatset $X$ on $R^2$ is a finite topological space. Current research of a dataset focuses on statistical methods and the algebraic topological method \cite{carlsson}. In \cite{hu}, the concept of typed topological space was introduced and showed to have the potential for studying finite topological spaces, such as a dataset. It is a new method from the general topology perspective. A typed topological space is a topological space whose open sets are assigned types. Topological concepts and methods can be redefined using open sets of certain types. In this article, we develop a special set of types and its related typed topology on a dataset $X$. Using it, we can investigate the inner structure of $X$. In particular, $R^2$ has a natural quotient space, in which $X$ is organized into tracks, and each track is split into components. Those components are in a order. Further, they can be represented by an integer sequence. Components crossing tracks form branches, and the relationship can be well represented by a type of pseudotree (called typed-II pseudotree). Such structures provide a platform for new algorithms for problems such as calculating convex hull, holes, clustering and anomaly detection.
STPFormer: A State-of-the-Art Pattern-Aware Spatio-Temporal Transformer for Traffic Forecasting
Fang, Jiayu, Shao, Zhiqi, Choy, S T Boris, Gao, Junbin
Spatio-temporal traffic forecasting is challenging due to complex temporal patterns, dynamic spatial structures, and diverse input formats. Although Transformer-based models offer strong global modeling, they often struggle with rigid temporal encoding and weak space-time fusion. We propose STPFormer, a Spatio-Temporal Pattern-Aware Transformer that achieves state-of-the-art performance via unified and interpretable representation learning. It integrates four modules: Temporal Position Aggregator (TPA) for pattern-aware temporal encoding, Spatial Sequence Aggregator (SSA) for sequential spatial learning, Spatial-Temporal Graph Matching (STGM) for cross-domain alignment, and an Attention Mixer for multi-scale fusion. Experiments on five real-world datasets show that STPFormer consistently sets new SOTA results, with ablation and visualizations confirming its effectiveness and generalizability.
Spatial-Temporal Transformer with Curriculum Learning for EEG-Based Emotion Recognition
Lin, Xuetao, Peng, Tianhao, Dai, Peihong, Liang, Yu, Wu, Wenjun
-- EEG-based emotion recognition plays an important role in developing adaptive brain-computer communication systems, yet faces two fundamental challenges in practical implementations: (1) effective integration of non-stationary spatial-temporal neural patterns, (2) robust adaptation to dynamic emotional intensity variations in real-world scenarios. This paper proposes STT -CL, a novel framework integrating spatial-temporal transformers with curriculum learning. Our method introduces two core components: a spatial encoder that models inter-channel relationships and a temporal encoder that captures multi-scale dependencies through windowed attention mechanisms, enabling simultaneous extraction of spatial correlations and temporal dynamics from EEG signals. Complementing this architecture, an intensity-aware curriculum learning strategy progressively guides training from high-intensity to low-intensity emotional states through dynamic sample scheduling based on a dual difficulty assessment. Comprehensive experiments on three benchmark datasets demonstrate state-of-the-art performance across various emotional intensity levels, with ablation studies confirming the necessity of both architectural components and the curriculum learning mechanism. Emotion recognition constitutes a fundamental component of brain-inspired human-computer interaction systems [1].
A Multi-Resolution Benchmark Framework for Spatial Reasoning Assessment in Neural Networks
Imbriani, Manuela, Belmonte, Gina, Massink, Mieke, Tofani, Alessandro, Ciancia, Vincenzo
This paper presents preliminary results in the definition of a comprehensive benchmark framework designed to systematically evaluate spatial reasoning capabilities in neural networks, with a particular focus on morphological properties such as connectivity and distance relationships. The framework is currently being used to study the capabilities of nnU-Net, exploiting the spatial model checker V oxLogicA to generate two distinct categories of synthetic datasets: maze connectivity problems for topological analysis and spatial distance computation tasks for geometric understanding. Each category is evaluated across multiple resolutions to assess scalability and generalization properties. The automated pipeline encompasses a complete machine learning workflow including: synthetic dataset generation, standardized training with cross-validation, inference execution, and comprehensive evaluation using Dice coefficient and IoU (Intersection over Union) metrics. Preliminary experimental results demonstrate significant challenges in neural network spatial reasoning capabilities, revealing systematic failures in basic geometric and topological understanding tasks. The framework provides a reproducible experimental protocol, enabling researchers to identify specific limitations. Such limitations could be addressed through hybrid approaches combining neural networks with symbolic reasoning methods for improved spatial understanding in clinical applications, establishing a foundation for ongoing research into neural network spatial reasoning limitations and potential solutions.
OmniD: Generalizable Robot Manipulation Policy via Image-Based BEV Representation
Mao, Jilei, Guan, Jiarui, Tang, Yingjuan, Hu, Qirui, Li, Zhihang, Yu, Junjie, Mao, Yongjie, Sun, Yunzhe, Liu, Shuang, Ju, Xiaozhu
Ensuring robust generalization across diverse environments and scenarios remains a central challenge for real-world embodied systems. The generalization challenges primarily manifest in positional variations, background interference, viewpoint shifts, morphological differences, illumination changes, and environmental dynamics[1, 2]. To provide a clearer critique for the model's generalization capability, inspired by [3], we formally define in-distribution (ID), out-of-distribution (OOD) evaluations, and combinatorial-distribution (CD) for embodied scenarios. Taking object position generalization as an example, as shown in Figure 1: when the spatial distribution of pumpkins in test data aligns with the training distribution, it constitutes an ID scenario; significantly divergent distributions indicate OOD cases, while intermediate variations correspond to CD with varying discrepancy levels. Building upon this generalization capability formalization, we systematically evaluate existing methodologies' effectiveness. Methods like DP[4], ACT[5], etc [6] are capable of performing complex manipulation tasks and get a high ID success rate. They are prone to overfit to the specific ID scenario and fail to generalize to OOD. Even minor camera pose perturbations or subtle background variations can lead to significant performance degradation.
EVTP-IVS: Effective Visual Token Pruning For Unifying Instruction Visual Segmentation In Multi-Modal Large Language Models
Zhu, Wenhui, Chen, Xiwen, Wang, Zhipeng, Tang, Shao, Ghosh, Sayan, Dong, Xuanzhao, Koner, Rajat, Wang, Yalin
Instructed Visual Segmentation (IVS) tasks require segmenting objects in images or videos based on natural language instructions. While recent multimodal large language models (MLLMs) have achieved strong performance on IVS, their inference cost remains a major bottleneck, particularly in video. W e empirically analyze visual token sampling in MLLMs and observe a strong correlation between subset token coverage and segmentation performance. This motivates our design of a simple and effective token pruning method that selects a compact yet spatially representative subset of tokens to accelerate inference. In this paper, we introduce a novel visual token pruning method for IVS, called EVTP-IV, which builds upon the k -center by integrating spatial information to ensure better coverage. W e further provide an information-theoretic analysis to support our design. Experiments on standard IVS benchmarks show that our method achieves up to 5 speed-up on video tasks and 3.5 on image tasks, while maintaining comparable accuracy using only 20% of the tokens. Our method also consistently outperforms state-of-the-art pruning baselines under varying pruning ratios.
From Heuristics to Data: Quantifying Site Planning Layout Indicators with Deep Learning and Multi-Modal Data
Cao, Qian, Chen, Jielin, Zhao, Junchao, Stouffs, Rudi
The spatial layout of urban sites shapes land-use efficiency and spatial organization. Traditional site planning often relies on experiential judgment and single-source data, limiting systematic quantification of multifunctional layouts. We propose a Site Planning Layout Indicator (SPLI) system, a data-driven framework integrating empirical knowledge with heterogeneous multi-source data to produce structured urban spatial information. The SPLI supports multimodal spatial data systems for analytics, inference, and retrieval by combining OpenStreetMap (OSM), Points of Interest (POI), building morphology, land use, and satellite imagery. It extends conventional metrics through five dimensions: (1) Hierarchical Building Function Classification, refining empirical systems into clear hierarchies; (2) Spatial Organization, quantifying seven layout patterns (e.g., symmetrical, concentric, axial-oriented); (3) Functional Diversity, transforming qualitative assessments into measurable indicators using Functional Ratio (FR) and Simpson Index (SI); (4) Accessibility to Essential Services, integrating facility distribution and transport networks for comprehensive accessibility metrics; and (5) Land Use Intensity, using Floor Area Ratio (FAR) and Building Coverage Ratio (BCR) to assess utilization efficiency. Data gaps are addressed through deep learning, including Relational Graph Neural Networks (RGNN) and Graph Neural Networks (GNN). Experiments show the SPLI improves functional classification accuracy and provides a standardized basis for automated, data-driven urban spatial analytics.
Enhancing Interactive Voting-Based Map Matching: Improving Efficiency and Robustness for Heterogeneous GPS Trajectories
Alemanni, William, Burzacchi, Arianna, Colombi, Davide, Giarratano, Elena
This paper presents an enhanced version of the Interactive Voting-Based Map Matching algorithm, designed to efficiently process trajectories with varying sampling rates. The main aim is to reconstruct GPS trajectories with high accuracy, independent of input data quality. Building upon the original algorithm, developed exclusively for aligning GPS signals to road networks, we extend its capabilities by integrating trajectory imputation. Our improvements also include the implementation of a distance-bounded interactive voting strategy to reduce computational complexity, as well as modifications to address missing data in the road network. Furthermore, we incorporate a custom-built asset derived from OpenStreetMap, enabling this approach to be smoothly applied in any geographic region covered by OpenStreetMap's road network. These advancements preserve the core strengths of the original algorithm while significantly extending its applicability to diverse real-world scenarios.