Spatial Reasoning
Forest aboveground biomass estimation using GEDI and earth observation data through attention-based deep learning
Dong, Wenquan, Mitchard, Edward T. A., Yu, Hao, Hancock, Steven, Ryan, Casey M.
Accurate quantification of forest aboveground biomass (AGB) is critical for understanding carbon accounting in the context of climate change. In this study, we presented a novel attention-based deep learning approach for forest AGB estimation, primarily utilizing openly accessible EO data, including: GEDI LiDAR data, C-band Sentinel-1 SAR data, ALOS-2 PALSAR-2 data, and Sentinel-2 multispectral data. The attention UNet (AU) model achieved markedly higher accuracy for biomass estimation compared to the conventional RF algorithm. Specifically, the AU model attained an R2 of 0.66, RMSE of 43.66 Mg ha-1, and bias of 0.14 Mg ha-1, while RF resulted in lower scores of R2 0.62, RMSE 45.87 Mg ha-1, and bias 1.09 Mg ha-1. However, the superiority of the deep learning approach was not uniformly observed across all tested models. ResNet101 only achieved an R2 of 0.50, an RMSE of 52.93 Mg ha-1, and a bias of 0.99 Mg ha-1, while the UNet reported an R2 of 0.65, an RMSE of 44.28 Mg ha-1, and a substantial bias of 1.84 Mg ha-1. Moreover, to explore the performance of AU in the absence of spatial information, fully connected (FC) layers were employed to eliminate spatial information from the remote sensing data. AU-FC achieved intermediate R2 of 0.64, RMSE of 44.92 Mgha-1, and bias of -0.56 Mg ha-1, outperforming RF but underperforming AU model using spatial information. We also generated 10m forest AGB maps across Guangdong for the year 2019 using AU and compared it with that produced by RF. The AGB distributions from both models showed strong agreement with similar mean values; the mean forest AGB estimated by AU was 102.18 Mg ha-1 while that of RF was 104.84 Mg ha-1. Additionally, it was observed that the AGB map generated by AU provided superior spatial information. Overall, this research substantiates the feasibility of employing deep learning for biomass estimation based on satellite data.
MultiSPANS: A Multi-range Spatial-Temporal Transformer Network for Traffic Forecast via Structural Entropy Optimization
Zou, Dongcheng, Wang, Senzhang, Li, Xuefeng, Peng, Hao, Wang, Yuandong, Liu, Chunyang, Sheng, Kehua, Zhang, Bo
Traffic forecasting is a complex multivariate time-series regression task of paramount importance for traffic management and planning. However, existing approaches often struggle to model complex multi-range dependencies using local spatiotemporal features and road network hierarchical knowledge. To address this, we propose MultiSPANS. First, considering that an individual recording point cannot reflect critical spatiotemporal local patterns, we design multi-filter convolution modules for generating informative ST-token embeddings to facilitate attention computation. Then, based on ST-token and spatial-temporal position encoding, we employ the Transformers to capture long-range temporal and spatial dependencies. Furthermore, we introduce structural entropy theory to optimize the spatial attention mechanism. Specifically, The structural entropy minimization algorithm is used to generate optimal road network hierarchies, i.e., encoding trees. Based on this, we propose a relative structural entropy-based position encoding and a multi-head attention masking scheme based on multi-layer encoding trees. Extensive experiments demonstrate the superiority of the presented framework over several state-of-the-art methods in real-world traffic datasets, and the longer historical windows are effectively utilized. The code is available at https://github.com/SELGroup/MultiSPANS.
Spatial Process Approximations: Assessing Their Necessity
In spatial statistics and machine learning, the kernel matrix plays a pivotal role in prediction, classification, and maximum likelihood estimation. A thorough examination reveals that for large sample sizes, the kernel matrix becomes ill-conditioned, provided the sampling locations are fairly evenly distributed. This condition poses significant challenges to numerical algorithms used in prediction and estimation computations and necessitates an approximation to prediction and the Gaussian likelihood. A review of current methodologies for managing large spatial data indicates that some fail to address this ill-conditioning problem. Such ill-conditioning often results in low-rank approximations of the stochastic processes. This paper introduces various optimality criteria and provides solutions for each.
A Spatial-Temporal Transformer based Framework For Human Pose Assessment And Correction in Education Scenarios
Hu, Wenyang, Liu, Kai, Liu, Libin, Shang, Huiliang
Human pose assessment and correction play a crucial role in applications across various fields, including computer vision, robotics, sports analysis, healthcare, and entertainment. In this paper, we propose a Spatial-Temporal Transformer based Framework (STTF) for human pose assessment and correction in education scenarios such as physical exercises and science experiment. The framework comprising skeletal tracking, pose estimation, posture assessment, and posture correction modules to educate students with professional, quick-to-fix feedback. We also create a pose correction method to provide corrective feedback in the form of visual aids. We test the framework with our own dataset. It comprises (a) new recordings of five exercises, (b) existing recordings found on the internet of the same exercises, and (c) corrective feedback on the recordings by professional athletes and teachers. Results show that our model can effectively measure and comment on the quality of students' actions. The STTF leverages the power of transformer models to capture spatial and temporal dependencies in human poses, enabling accurate assessment and effective correction of students' movements.
RIR-SF: Room Impulse Response Based Spatial Feature for Multi-channel Multi-talker ASR
Shao, Yiwen, Zhang, Shi-Xiong, Yu, Dong
Multi-channel multi-talker automatic speech recognition (ASR) presents ongoing challenges within the speech community, particularly when confronted with significant reverberation effects. In this study, we introduce a novel approach involving the convolution of overlapping speech signals with the room impulse response (RIR) corresponding to the target speaker's transmission to a microphone array. This innovative technique yields a novel spatial feature known as the RIR-SF. Through a comprehensive comparison with the previously established state-of-the-art 3D spatial feature, both theoretical analysis and experimental results substantiate the superiority of our proposed RIR-SF. We demonstrate that the RIR-SF outperforms existing methods, leading to a remarkable 21.3\% relative reduction in the Character Error Rate (CER) in multi-channel multi-talker ASR systems. Importantly, this novel feature exhibits robustness in the face of strong reverberation, surpassing the limitations of previous approaches.
What's "up" with vision-language models? Investigating their struggle with spatial reasoning
Kamath, Amita, Hessel, Jack, Chang, Kai-Wei
Recent vision-language (VL) models are powerful, but can they reliably distinguish "right" from "left"? We curate three new corpora to quantify model comprehension of such basic spatial relations. These tests isolate spatial reasoning more precisely than existing datasets like VQAv2, e.g., our What'sUp benchmark contains sets of photographs varying only the spatial relations of objects, keeping their identity fixed (see Figure 1: models must comprehend not only the usual case of a dog under a table, but also, the same dog on top of the same table). We evaluate 18 VL models, finding that all perform poorly, e.g., BLIP finetuned on VQAv2, which nears human parity on VQAv2, achieves 56% accuracy on our benchmarks vs. humans at 99%. We conclude by studying causes of this surprising behavior, finding: 1) that popular vision-language pretraining corpora like LAION-2B contain little reliable data for learning spatial relationships; and 2) that basic modeling interventions like up-weighting preposition-containing instances or fine-tuning on our corpora are not sufficient to address the challenges our benchmarks pose. We are hopeful that these corpora will facilitate further research, and we release our data and code at https://github.com/amitakamath/whatsup_vlms.
DiffSpectralNet : Unveiling the Potential of Diffusion Models for Hyperspectral Image Classification
Sigger, Neetu, Nguyen, Tuan Thanh, Tozzi, Gianluca, Vien, Quoc-Tuan, Van Nguyen, Sinh
Hyperspectral images (HSI) have become popular for analysing remotely sensed images in multiple domain like agriculture, medical. However, existing models struggle with complex relationships and characteristics of spectral-spatial data due to the multi-band nature and data redundancy of hyperspectral data. To address this limitation, we propose a new network called DiffSpectralNet, which combines diffusion and transformer techniques. Our approach involves a two-step process. First, we use an unsupervised learning framework based on the diffusion model to extract both high-level and low-level spectral-spatial features. The diffusion method is capable of extracting diverse and meaningful spectral-spatial features, leading to improvement in HSI classification. Then, we employ a pretrained denoising U-Net to extract intermediate hierarchical features for classification. Finally, we use a supervised transformer-based classifier to perform the HSI classification. Through comprehensive experiments on HSI datasets, we evaluate the classification performance of DiffSpectralNet. The results demonstrate that our framework significantly outperforms existing approaches, achieving state-of-the-art performance.
Optimal Spatial-Temporal Triangulation for Bearing-Only Cooperative Motion Estimation
Zheng, Canlun, Mi, Yize, Guo, Hanqing, Chen, Huaben, Lin, Zhiyun, Zhao, Shiyu
Vision-based cooperative motion estimation is an important problem for many multi-robot systems such as cooperative aerial target pursuit. This problem can be formulated as bearing-only cooperative motion estimation, where the visual measurement is modeled as a bearing vector pointing from the camera to the target. The conventional approaches for bearing-only cooperative estimation are mainly based on the framework distributed Kalman filtering (DKF). In this paper, we propose a new optimal bearing-only cooperative estimation algorithm, named spatial-temporal triangulation, based on the method of distributed recursive least squares, which provides a more flexible framework for designing distributed estimators than DKF. The design of the algorithm fully incorporates all the available information and the specific triangulation geometric constraint. As a result, the algorithm has superior estimation performance than the state-of-the-art DKF algorithms in terms of both accuracy and convergence speed as verified by numerical simulation. We rigorously prove the exponential convergence of the proposed algorithm. Moreover, to verify the effectiveness of the proposed algorithm under practical challenging conditions, we develop a vision-based cooperative aerial target pursuit system, which is the first of such fully autonomous systems so far to the best of our knowledge.
Video-FocalNets: Spatio-Temporal Focal Modulation for Video Action Recognition
Wasim, Syed Talal, Khattak, Muhammad Uzair, Naseer, Muzammal, Khan, Salman, Shah, Mubarak, Khan, Fahad Shahbaz
Recent video recognition models utilize Transformer models for long-range spatio-temporal context modeling. Video transformer designs are based on self-attention that can model global context at a high computational cost. In comparison, convolutional designs for videos offer an efficient alternative but lack long-range dependency modeling. Towards achieving the best of both designs, this work proposes Video-FocalNet, an effective and efficient architecture for video recognition that models both local and global contexts. Video-FocalNet is based on a spatio-temporal focal modulation architecture that reverses the interaction and aggregation steps of self-attention for better efficiency. Further, the aggregation step and the interaction step are both implemented using efficient convolution and element-wise multiplication operations that are computationally less expensive than their self-attention counterparts on video representations. We extensively explore the design space of focal modulation-based spatio-temporal context modeling and demonstrate our parallel spatial and temporal encoding design to be the optimal choice. Video-FocalNets perform favorably well against the state-of-the-art transformer-based models for video recognition on five large-scale datasets (Kinetics-400, Kinetics-600, SS-v2, Diving-48, and ActivityNet-1.3) at a lower computational cost. Our code/models are released at https://github.com/TalalWasim/Video-FocalNets.
Where you go is who you are -- A study on machine learning based semantic privacy attacks
Wiedemann, Nina, Kounadi, Ourania, Raubal, Martin, Janowicz, Krzysztof
Concerns about data privacy are omnipresent, given the increasing usage of digital applications and their underlying business model that includes selling user data. Location data is particularly sensitive since they allow us to infer activity patterns and interests of users, e.g., by categorizing visited locations based on nearby points of interest (POI). On top of that, machine learning methods provide new powerful tools to interpret big data. In light of these considerations, we raise the following question: What is the actual risk that realistic, machine learning based privacy attacks can obtain meaningful semantic information from raw location data, subject to inaccuracies in the data? In response, we present a systematic analysis of two attack scenarios, namely location categorization and user profiling. Experiments on the Foursquare dataset and tracking data demonstrate the potential for abuse of high-quality spatial information, leading to a significant privacy loss even with location inaccuracy of up to 200m. With location obfuscation of more than 1 km, spatial information hardly adds any value, but a high privacy risk solely from temporal information remains. The availability of public context data such as POIs plays a key role in inference based on spatial information. Our findings point out the risks of ever-growing databases of tracking data and spatial context data, which policymakers should consider for privacy regulations, and which could guide individuals in their personal location protection measures.