Goto

Collaborating Authors

 Spatial Reasoning


LOViS: Learning Orientation and Visual Signals for Vision and Language Navigation

arXiv.org Artificial Intelligence

Understanding spatial and visual information is essential for a navigation agent who follows natural language instructions. The current Transformer-based VLN agents entangle the orientation and vision information, which limits the gain from the learning of each information source. In this paper, we design a neural agent with explicit Orientation and Vision modules. Those modules learn to ground spatial information and landmark mentions in the instructions to the visual environment more effectively. To strengthen the spatial reasoning and visual perception of the agent, we design specific pre-training tasks to feed and better utilize the corresponding modules in our final navigation model. We evaluate our approach on both Room2room (R2R) and Room4room (R4R) datasets and achieve the state of the art results on both benchmarks.


Spatial-Temporal Interactive Dynamic Graph Convolution Network for Traffic Forecasting

arXiv.org Artificial Intelligence

Accurate traffic forecasting is essential for smart cities to achieve traffic control, route planning, and flow detection. Although many spatial-temporal methods are currently proposed, these methods are deficient in capturing the spatial-temporal dependence of traffic data synchronously. In addition, most of the methods ignore the dynamically changing correlations between road network nodes that arise as traffic data changes. We propose a neural network-based Spatial-Temporal Interactive Dynamic Graph Convolutional Network (STIDGCN) to address the above challenges for traffic forecasting. Specifically, we propose an interactive dynamic graph convolution structure, which divides the sequences at intervals and synchronously captures the traffic data's spatial-temporal dependence through an interactive learning strategy. The interactive learning strategy makes STIDGCN effective for long-term prediction. We also propose a novel dynamic graph convolution module to capture the dynamically changing correlations in the traffic network, consisting of a graph generator and fusion graph convolution. The dynamic graph convolution module can use the input traffic data and pre-defined graph structure to generate a graph structure. It is then fused with the defined adaptive adjacency matrix to generate a dynamic adjacency matrix, which fills the pre-defined graph structure and simulates the generation of dynamic associations between nodes in the road network. Extensive experiments on four real-world traffic flow datasets demonstrate that STIDGCN outperforms the state-of-the-art baseline.


Spatial-temporal Transformers for EEG Emotion Recognition

arXiv.org Artificial Intelligence

Electroencephalography (EEG) is a popular and effective tool for emotion recognition. However, the propagation mechanisms of EEG in the human brain and its intrinsic correlation with emotions are still obscure to researchers. This work proposes four variant transformer frameworks~(spatial attention, temporal attention, sequential spatial-temporal attention and simultaneous spatial-temporal attention) for EEG emotion recognition to explore the relationship between emotion and spatial-temporal EEG features. Specifically, spatial attention and temporal attention are to learn the topological structure information and time-varying EEG characteristics for emotion recognition respectively. Sequential spatial-temporal attention does the spatial attention within a one-second segment and temporal attention within one sample sequentially to explore the influence degree of emotional stimulation on EEG signals of diverse EEG electrodes in the same temporal segment. The simultaneous spatial-temporal attention, whose spatial and temporal attention are performed simultaneously, is used to model the relationship between different spatial features in different time segments. The experimental results demonstrate that simultaneous spatial-temporal attention leads to the best emotion recognition accuracy among the design choices, indicating modeling the correlation of spatial and temporal features of EEG signals is significant to emotion recognition.


Spatial-Temporal Deep Embedding for Vehicle Trajectory Reconstruction from High-Angle Video

arXiv.org Artificial Intelligence

Spatial-temporal Map (STMap)-based methods have shown great potential to process high-angle videos for vehicle trajectory reconstruction, which can meet the needs of various data-driven modeling and imitation learning applications. In this paper, we developed Spatial-Temporal Deep Embedding (STDE) model that imposes parity constraints at both pixel and instance levels to generate instance-aware embeddings for vehicle stripe segmentation on STMap. At pixel level, each pixel was encoded with its 8-neighbor pixels at different ranges, and this encoding is subsequently used to guide a neural network to learn the embedding mechanism. At the instance level, a discriminative loss function is designed to pull pixels belonging to the same instance closer and separate the mean value of different instances far apart in the embedding space. The output of the spatial-temporal affinity is then optimized by the mutex-watershed algorithm to obtain final clustering results. Based on segmentation metrics, our model outperformed five other baselines that have been used for STMap processing and shows robustness under the influence of shadows, static noises, and overlapping. The designed model is applied to process all public NGSIM US-101 videos to generate complete vehicle trajectories, indicating a good scalability and adaptability. Last but not least, the strengths of the scanline method with STDE and future directions were discussed. Code, STMap dataset and video trajectory are made publicly available in the online repository. GitHub Link: shorturl.at/jklT0.


SeqOT: A Spatial-Temporal Transformer Network for Place Recognition Using Sequential LiDAR Data

arXiv.org Artificial Intelligence

Place recognition is an important component for autonomous vehicles to achieve loop closing or global localization. In this paper, we tackle the problem of place recognition based on sequential 3D LiDAR scans obtained by an onboard LiDAR sensor. We propose a transformer-based network named SeqOT to exploit the temporal and spatial information provided by sequential range images generated from the LiDAR data. It uses multi-scale transformers to generate a global descriptor for each sequence of LiDAR range images in an end-to-end fashion. During online operation, our SeqOT finds similar places by matching such descriptors between the current query sequence and those stored in the map. We evaluate our approach on four datasets collected with different types of LiDAR sensors in different environments. The experimental results show that our method outperforms the state-of-the-art LiDAR-based place recognition methods and generalizes well across different environments. Furthermore, our method operates online faster than the frame rate of the sensor. The implementation of our method is released as open source at: https://github.com/BIT-MJY/SeqOT.


SORNet: Spatial Object-Centric Representations for Sequential Manipulation

arXiv.org Artificial Intelligence

Sequential manipulation tasks require a robot to perceive the state of an environment and plan a sequence of actions leading to a desired goal state. In such tasks, the ability to reason about spatial relations among object entities from raw sensor inputs is crucial in order to determine when a task has been completed and which actions can be executed. In this work, we propose SORNet (Spatial Object-Centric Representation Network), a framework for learning object-centric representations from RGB images conditioned on a set of object queries, represented as image patches called canonical object views. With only a single canonical view per object and no annotation, SORNet generalizes zero-shot to object entities whose shape and texture are both unseen during training. We evaluate SORNet on various spatial reasoning tasks such as spatial relation classification and relative direction regression in complex tabletop manipulation scenarios and show that SORNet significantly outperforms baselines including state-of-the-art representation learning techniques. We also demonstrate the application of the representation learned by SORNet on visual-servoing and task planning for sequential manipulation on a real robot.


A topological analysis of cointegrated data: a Z24 Bridge case study

arXiv.org Artificial Intelligence

The paper studies the topological changes from before and after cointegration, for the natural frequencies of the Z24 Bridge. The second natural frequency is known to be nonlinear in temperature, and this will serve as the main focal point of this work. Cointegration is a method of normalising time series data with respect to one another - often strongly-correlated time series. Cointegration is used in this paper to remove effects from Environmental and Operational Variations, by cointegrating the first four natural frequencies for the Z24 Bridge data. The temperature effects on the natural frequency data are clearly visible within the data, and it is desirable, for the purposes of structural health monitoring, that these effects are removed. The univariate time series are embedded in higher-dimensional space, such that interesting topologies are formed. Topological data analysis is used to analyse the raw time series, and the cointegrated equivalents. A standard topological data analysis pipeline is enacted, where simplicial complexes are constructed from the embedded point clouds. Topological properties are then calculated from the simplicial complexes; such as the persistent homology. The persistent homology is then analysed, to determine the topological structure of all the time series.


Cross-Subject Domain Adaptation for Classifying Working Memory Load with Multi-Frame EEG Images

arXiv.org Artificial Intelligence

Working memory (WM), denoting the information temporally stored in the mind, is a fundamental research topic in the field of human cognition. Electroencephalograph (EEG), which can monitor the electrical activity of the brain, has been widely used in measuring the level of WM. However, one of the critical challenges is that individual differences may cause ineffective results, especially when the established model meets an unfamiliar subject. In this work, we propose a cross-subject deep adaptation model with spatial attention (CS-DASA) to generalize the workload classifications across subjects. First, we transform EEG time series into multi-frame EEG images incorporating spatial, spectral, and temporal information. First, the Subject-Shared module in CS-DASA receives multi-frame EEG image data from both source and target subjects and learns the common feature representations. Then, in the subject-specific module, the maximum mean discrepancy is implemented to measure the domain distribution divergence in a reproducing kernel Hilbert space, which can add an effective penalty loss for domain adaptation. Additionally, the subject-to-subject spatial attention mechanism is employed to focus on the discriminative spatial features from the target image data. Experiments conducted on a public WM EEG dataset containing 13 subjects show that the proposed model is capable of achieving better performance than existing state-of-the-art methods.


PTSEFormer: Progressive Temporal-Spatial Enhanced TransFormer Towards Video Object Detection

arXiv.org Artificial Intelligence

Recent years have witnessed a trend of applying context frames to boost the performance of object detection as video object detection. Existing methods usually aggregate features at one stroke to enhance the feature. These methods, however, usually lack spatial information from neighboring frames and suffer from insufficient feature aggregation. To address the issues, we perform a progressive way to introduce both temporal information and spatial information for an integrated enhancement. The temporal information is introduced by the temporal feature aggregation model (TFAM), by conducting an attention mechanism between the context frames and the target frame (i.e., the frame to be detected). Meanwhile, we employ a Spatial Transition Awareness Model (STAM) to convey the location transition information between each context frame and target frame. Built upon a transformer-based detector DETR, our PTSEFormer also follows an end-to-end fashion to avoid heavy post-processing procedures while achieving 88.1% mAP on the ImageNet VID dataset. Codes are available at https://github.com/Hon-Wong/PTSEFormer.


Spatial motion planning with Pythagorean Hodograph curves

arXiv.org Artificial Intelligence

This paper presents a two-stage prediction-based control scheme for embedding the environment's geometric properties into a collision-free Pythagorean Hodograph spline, and subsequently finding the optimal path within the parameterized free space. The ingredients of this approach are twofold: First, we present a novel spatial path parameterization applicable to any arbitrary curve without prior assumptions in its adapted frame. Second, we identify the appropriateness of Pythagorean Hodograph curves for a compact and continuous definition of the path-parametric functions required by the presented spatial model. This dual-stage formulation results in a motion planning approach, where the geometric properties of the environment arise as states of the prediction model. Thus, the presented method is attractive for motion planning in dense environments. The efficacy of the approach is evaluated according to an illustrative example.