Goto

Collaborating Authors

 Spatial Reasoning


Generic Multimodal Spatially Graph Network for Spatially Embedded Network Representation Learning

arXiv.org Artificial Intelligence

Spatially embedded networks (SENs) represent a special type of complex graph, whose topologies are constrained by the networks' embedded spatial environments. The graph representation of such networks is thereby influenced by the embedded spatial features of both nodes and edges. Accurate network representation of the graph structure and graph features is a fundamental task for various graph-related tasks. In this study, a Generic Multimodal Spatially Graph Convolutional Network (GMu-SGCN) is developed for efficient representation of spatially embedded networks. The developed GMu-SGCN model has the ability to learn the node connection pattern via multimodal node and edge features. In order to evaluate the developed model, a river network dataset and a power network dataset have been used as test beds. The river network represents the naturally developed SENs, whereas the power network represents a man-made network. Both types of networks are heavily constrained by the spatial environments and uncertainties from nature. Comprehensive evaluation analysis shows the developed GMu-SGCN can improve accuracy of the edge existence prediction task by 37.1\% compared to a GraphSAGE model which only considers the node's position feature in a power network test bed. Our model demonstrates the importance of considering the multidimensional spatial feature for spatially embedded network representation.


A Hybrid Random Forest and CNN Framework for Tile-Wise Oil-Water Classification in Hyperspectral Images

arXiv.org Artificial Intelligence

A novel hybrid Random Forest and Convolutional Neural Network (CNN) framework is presented for oil-water classification in hyperspectral images (HSI). To address the challenge of preserving spatial context, the images were divided into smaller, non-overlapping tiles, which served as the basis for training, validation, and testing. Random Forest demonstrated strong performance in pixel-wise classification, outperforming models such as XGBoost, Attention-Based U-Net, and HybridSN. However, Random Forest loses spatial context, limiting its ability to fully exploit the spatial relationships in hyperspectral data. To improve performance, a CNN was trained on the probability maps generated by the Random Forest, leveraging the CNN's capacity to incorporate spatial context. The hybrid approach achieved 7.6% improvement in recall (to 0.85), 2.4% improvement in F1 score (to 0.84), and 0.54% improvement in AUC (to 0.99) compared to the baseline. These results highlight the effectiveness of combining probabilistic outputs with spatial feature learning for context-aware analysis of hyperspectral images.


RLS3: RL-Based Synthetic Sample Selection to Enhance Spatial Reasoning in Vision-Language Models for Indoor Autonomous Perception

arXiv.org Artificial Intelligence

Vision-language model (VLM) fine-tuning for application-specific visual grounding based on natural language instructions has become one of the most popular approaches for learning-enabled autonomous systems. However, such fine-tuning relies heavily on high-quality datasets to achieve successful performance in various downstream tasks. Additionally, VLMs often encounter limitations due to insufficient and imbalanced fine-tuning data. To address these issues, we propose a new generalizable framework to improve VLM fine-tuning by integrating it with a reinforcement learning (RL) agent. Our method utilizes the RL agent to manipulate objects within an indoor setting to create synthetic data for fine-tuning to address certain vulnerabilities of the VLM. Specifically, we use the performance of the VLM to provide feedback to the RL agent to generate informative data that efficiently fine-tune the VLM over the targeted task (e.g. spatial reasoning). The key contribution of this work is developing a framework where the RL agent serves as an informative data sampling tool and assists the VLM in order to enhance performance and address task-specific vulnerabilities. By targeting the data sampling process to address the weaknesses of the VLM, we can effectively train a more context-aware model. In addition, generating synthetic data allows us to have precise control over each scene and generate granular ground truth captions. Our results show that the proposed data generation approach improves the spatial reasoning performance of VLMs, which demonstrates the benefits of using RL-guided data generation in vision-language tasks.


An Optimal Cascade Feature-Level Spatiotemporal Fusion Strategy for Anomaly Detection in CAN Bus

arXiv.org Artificial Intelligence

Autonomous vehicles represent a revolutionary advancement driven by the integration of artificial intelligence within intelligent transportation systems. However, they remain vulnerable due to the absence of robust security mechanisms in the Controller Area Network (CAN) bus. In order to mitigate the security issue, many machine learning models and strategies have been proposed, which primarily focus on a subset of dominant patterns of anomalies and lack rigorous evaluation in terms of reliability and robustness. Therefore, to address the limitations of previous works and mitigate the security vulnerability in CAN bus, the current study develops a model based on the intrinsic nature of the problem to cover all dominant patterns of anomalies. To achieve this, a cascade feature-level fusion strategy optimized by a two-parameter genetic algorithm is proposed to combine temporal and spatial information. Subsequently, the model is evaluated using a paired t-test to ensure reliability and robustness. Finally, a comprehensive comparative analysis conducted on two widely used datasets advocates that the proposed model outperforms other models and achieves superior accuracy and F1-score, demonstrating the best performance among all models presented to date.


SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model

arXiv.org Artificial Intelligence

In this paper, we claim that spatial understanding is the keypoint in robot manipulation, and propose SpatialVLA to explore effective spatial representations for the robot foundation model. Specifically, we introduce Ego3D Position Encoding to inject 3D information into the input observations of the visual-language-action model, and propose Adaptive Action Grids to represent spatial robot movement actions with adaptive discretized action grids, facilitating learning generalizable and transferrable spatial action knowledge for cross-robot control. SpatialVLA is first pre-trained on top of a vision-language model with 1.1 Million real-world robot episodes, to learn a generalist manipulation policy across multiple robot environments and tasks. After pre-training, SpatialVLA is directly applied to perform numerous tasks in a zero-shot manner. The superior results in both simulation and real-world robots demonstrate its advantage of inferring complex robot motion trajectories and its strong in-domain multi-task generalization ability. We further show the proposed Adaptive Action Grids offer a new and effective way to fine-tune the pre-trained SpatialVLA model for new simulation and real-world setups, where the pre-learned action grids are re-discretized to capture robot-specific spatial action movements of new setups. The superior results from extensive evaluations demonstrate the exceptional in-distribution generalization and out-of-distribution adaptation capability, highlighting the crucial benefit of the proposed spatial-aware representations for generalist robot policy learning. All the details and codes will be open-sourced.


Foundation for unbiased cross-validation of spatio-temporal models for species distribution modeling

arXiv.org Artificial Intelligence

Species Distribution Models (SDMs) often suffer from spatial autocorrelation (SAC), leading to biased performance estimates. We tested cross-validation (CV) strategies - random splits, spatial blocking with varied distances, environmental (ENV) clustering, and a novel spatio-temporal method - under two proposed training schemes: LAST FOLD, widely used in spatial CV at the cost of data loss, and RETRAIN, which maximizes data usage but risks reintroducing SAC. LAST FOLD consistently yielded lower errors and stronger correlations. Spatial blocking at an optimal distance (SP 422) and ENV performed best, achieving Spearman and Pearson correlations of 0.485 and 0.548, respectively, although ENV may be unsuitable for long-term forecasts involving major environmental shifts. A spatio-temporal approach yielded modest benefits in our moderately variable dataset, but may excel with stronger temporal changes. These findings highlight the need to align CV approaches with the spatial and temporal structure of SDM data, ensuring rigorous validation and reliable predictive outcomes.


Exact Fit Attention in Node-Holistic Graph Convolutional Network for Improved EEG-Based Driver Fatigue Detection

arXiv.org Artificial Intelligence

-- EEG-based fatigue monitoring can effectively reduce the incidence of related traffic accidents. In the past decade, with the advancement of deep learning, convolu-tional neural networks (CNN) have been increasingly used for EEG signal processing. However, due to the data's non-Euclidean characteristics, existing CNNs may lose important spatial information from EEG, specifically channel correlation. Thus, we propose the node-holistic graph convo-lutional network (NHGNet), a model that uses graphic convolution to dynamically learn each channel's features. The interpretability is enhanced by revealing critical areas of brain activity and their interrelations in various mental states. In validations on two public datasets, NHGNet outperforms the SOTAs. Specifically, in the intra-subject, NHGNet improved detection accuracy by at least 2.34% and 3.42%, and in the inter-subjects, it improved by at least 2.09% and 15.06%. Visualization research on the model revealed that the central parietal area plays an important role in detecting fatigue levels, whereas the frontal and temporal lobes are essential for maintaining vigilance. Duo Chen is with the School of Artificial Intelligence and Information T echnology, Nanjing University of Chinese Medicine, Nanjing 210023, China (e-mail: 380013@njucm.edu.cn). Yi Ding is with the College of Computing and Data Science, Nanyang T echnological University, Singapore.


Explainable XR: Understanding User Behaviors of XR Environments using LLM-assisted Analytics Framework

arXiv.org Artificial Intelligence

We present Explainable XR, an end-to-end framework for analyzing user behavior in diverse eXtended Reality (XR) environments by leveraging Large Language Models (LLMs) for data interpretation assistance. Existing XR user analytics frameworks face challenges in handling cross-virtuality - AR, VR, MR - transitions, multi-user collaborative application scenarios, and the complexity of multimodal data. Explainable XR addresses these challenges by providing a virtuality-agnostic solution for the collection, analysis, and visualization of immersive sessions. We propose three main components in our framework: (1) A novel user data recording schema, called User Action Descriptor (UAD), that can capture the users' multimodal actions, along with their intents and the contexts; (2) a platform-agnostic XR session recorder, and (3) a visual analytics interface that offers LLM-assisted insights tailored to the analysts' perspectives, facilitating the exploration and analysis of the recorded XR session data. We demonstrate the versatility of Explainable XR by demonstrating five use-case scenarios, in both individual and collaborative XR applications across virtualities. Our technical evaluation and user studies show that Explainable XR provides a highly usable analytics solution for understanding user actions and delivering multifaceted, actionable insights into user behaviors in immersive environments.


SpatialCoT: Advancing Spatial Reasoning through Coordinate Alignment and Chain-of-Thought for Embodied Task Planning

arXiv.org Artificial Intelligence

Spatial reasoning is an essential problem in embodied AI research. Efforts to enhance spatial reasoning abilities through supplementary spatial data and fine-tuning have proven limited and ineffective when addressing complex embodied tasks, largely due to their dependence on language-based outputs. While some approaches have introduced a point-based action space to mitigate this issue, they fall short in managing more intricate tasks within complex environments. This deficiency arises from their failure to fully exploit the inherent thinking and reasoning capabilities that are fundamental strengths of Vision-Language Models (VLMs). To address these limitations, we propose a novel approach named SpatialCoT, specifically designed to bolster the spatial reasoning capabilities of VLMs. Our approach comprises two stages: spatial coordinate bi-directional alignment, which aligns vision-language inputs with spatial coordinates, and chain-of-thought spatial grounding, which harnesses the reasoning capabilities of language models for advanced spatial reasoning. We evaluate SpatialCoT on challenging navigation and manipulation tasks, both in simulation and real-world settings. Experimental results demonstrate that our method significantly outperforms previous state-of-the-art approaches in both tasks.


Spatially-Delineated Domain-Adapted AI Classification: An Application for Oncology Data

arXiv.org Artificial Intelligence

Given multi-type point maps from different place-types (e.g., tumor regions), our objective is to develop a classifier trained on the source place-type to accurately distinguish between two classes of the target place-type based on their point arrangements. This problem is societally important for many applications, such as generating clinical hypotheses for designing new immunotherapies for cancer treatment. The challenge lies in the spatial variability, the inherent heterogeneity and variation observed in spatial properties or arrangements across different locations (i.e., place-types). Previous techniques focus on self-supervised tasks to learn domain-invariant features and mitigate domain differences; however, they often neglect the underlying spatial arrangements among data points, leading to significant discrepancies across different place-types. We explore a novel multi-task self-learning framework that targets spatial arrangements, such as spatial mix-up masking and spatial contrastive predictive coding, for spatially-delineated domain-adapted AI classification. Experimental results on real-world datasets (e.g., oncology data) show that the proposed framework provides higher prediction accuracy than baseline methods.