Geophysical Analysis & Survey
CARE: Confidence-Aware Regression Estimation of building density fine-tuning EO Foundation Models
Dionelis, Nikolaos, Bosmans, Jente, Longépé, Nicolas
--Performing accurate confidence quantification and assessment in pixel-wise regression tasks, which are downstream applications of AI Foundation Models for Earth Observation (EO), is important for deep neural networks to predict their failures, improve their performance and enhance their capabilities in real-world applications, for their practical deployment. For pixel-wise regression tasks, specifically utilizing remote sensing data from satellite imagery in EO Foundation Models, confidence quantification is a critical challenge. The focus of this research is on developing a Foundation Model using EO satellite data that computes and assigns a confidence metric alongside regression outputs to improve the reliability and interpretability of predictions generated by deep neural networks. T o this end, we develop, train and evaluate the proposed Confidence-A ware Regression Estimation (CARE) Foundation Model. Our model CARE computes and assigns confidence to regression results as downstream tasks of a Foundation Model for EO data, and performs a confidence-aware self-corrective learning method for the low-confidence regions. We evaluate the model CARE, and experimental results on multi-spectral data from the Copernicus Sentinel-2 constellation to estimate the building density (i.e. We also show that our model CARE outperforms other methods. The significance of confidence quantification and assessment in deep learning, specifically in AI Foundation Models in Earth Observation (EO) that use satellite data, for regression applications is critical. The utility of satellite data seems inexhaustible, and thanks to developments in AI, applications emerge at an accelerated pace in EO Foundation Models using remote sensing data.
Remote Sensing Semantic Segmentation Quality Assessment based on Vision Language Model
Shi, Huiying, Tan, Zhihong, Zhang, Zhihan, Wei, Hongchen, Hu, Yaosi, Zhang, Yingxue, Chen, Zhenzhong
The complexity of scenes and variations in image quality result in significant variability in the performance of semantic segmentation methods of remote sensing imagery (RSI) in supervised real-world scenarios. This makes the evaluation of semantic segmentation quality in such scenarios an issue to be resolved. However, most of the existing evaluation metrics are developed based on expert-labeled object-level annotations, which are not applicable in such scenarios. To address this issue, we propose RS-SQA, an unsupervised quality assessment model for RSI semantic segmentation based on vision language model (VLM). This framework leverages a pre-trained RS VLM for semantic understanding and utilizes intermediate features from segmentation methods to extract implicit information about segmentation quality. Specifically, we introduce CLIP-RS, a large-scale pre-trained VLM trained with purified text to reduce textual noise and capture robust semantic information in the RS domain. Feature visualizations confirm that CLIP-RS can effectively differentiate between various levels of segmentation quality. Semantic features and low-level segmentation features are effectively integrated through a semantic-guided approach to enhance evaluation accuracy. To further support the development of RS semantic segmentation quality assessment, we present RS-SQED, a dedicated dataset sampled from four major RS semantic segmentation datasets and annotated with segmentation accuracy derived from the inference results of 8 representative segmentation methods. Experimental results on the established dataset demonstrate that RS-SQA significantly outperforms state-of-the-art quality assessment models. This provides essential support for predicting segmentation accuracy and high-quality semantic segmentation interpretation, offering substantial practical value.
JL1-CD: A New Benchmark for Remote Sensing Change Detection and a Robust Multi-Teacher Knowledge Distillation Framework
Liu, Ziyuan, Zhu, Ruifei, Gao, Long, Zhou, Yuanxiu, Ma, Jingyu, Gu, Yuantao
Xu et al. [24] introduce a semi-supervised label and embedding consistency network (SS-LEC) for ORSI scene classification, which strategically enforces consistency across augmentations and stages of training. Li et al. [25] propose SemiCD-VL, a VLM-guided semi-supervised change detection method that synthesizes pseudo labels via a mixed change event generation strategy, achieving significant performance gains over FixMatch and SOT A unsupervised methods. However, DL-based CD methods generally face two major challenges: the scarcity of high-quality, high-resolution, all-inclusive CD datasets and limitations in handling highly dynamic change areas. Although numerous CD datasets have been constructed and proposed, they are often tailored to specific scenarios, which restricts the generalization capabilities of the algorithms. For instance, models trained on datasets focused on human-induced changes often fail to perform effectively when confronted with natural change scenarios.
SARChat-Bench-2M: A Multi-Task Vision-Language Benchmark for SAR Image Interpretation
Ma, Zhiming, Xiao, Xiayang, Dong, Sihao, Wang, Peidong, Wang, HaiPeng, Pan, Qingyun
As a powerful all-weather Earth observation tool, synthetic aperture radar (SAR) remote sensing enables critical military reconnaissance, maritime surveillance, and infrastructure monitoring. Although Vision language models (VLMs) have made remarkable progress in natural language processing and image understanding, their applications remain limited in professional domains due to insufficient domain expertise. This paper innovatively proposes the first large-scale multimodal dialogue dataset for SAR images, named SARChat-2M, which contains approximately 2 million high-quality image-text pairs, encompasses diverse scenarios with detailed target annotations. This dataset not only supports several key tasks such as visual understanding and object detection tasks, but also has unique innovative aspects: this study develop a visual-language dataset and benchmark for the SAR domain, enabling and evaluating VLMs' capabilities in SAR image interpretation, which provides a paradigmatic framework for constructing multimodal datasets across various remote sensing vertical domains. Through experiments on 16 mainstream VLMs, the effectiveness of the dataset has been fully verified. The project will be released at https://github.com/JimmyMa99/SARChat.
DiSciPLE: Learning Interpretable Programs for Scientific Visual Discovery
Mall, Utkarsh, Phoo, Cheng Perng, Chiquier, Mia, Hariharan, Bharath, Bala, Kavita, Vondrick, Carl
Visual data is used in numerous different scientific workflows ranging from remote sensing to ecology. As the amount of observation data increases, the challenge is not just to make accurate predictions but also to understand the underlying mechanisms for those predictions. Good interpretation is important in scientific workflows, as it allows for better decision-making by providing insights into the data. This paper introduces an automatic way of obtaining such interpretable-by-design models, by learning programs that interleave neural networks. We propose DiSciPLE (Discovering Scientific Programs using LLMs and Evolution) an evolutionary algorithm that leverages common sense and prior knowledge of large language models (LLMs) to create Python programs explaining visual data. Additionally, we propose two improvements: a program critic and a program simplifier to improve our method further to synthesize good programs. On three different real-world problems, DiSciPLE learns state-of-the-art programs on novel tasks with no prior literature. For example, we can learn programs with 35% lower error than the closest non-interpretable baseline for population density estimation.
FE-LWS: Refined Image-Text Representations via Decoder Stacking and Fused Encodings for Remote Sensing Image Captioning
Remote sensing image captioning aims to generate descriptive text from remote sensing images, typically employing an encoder-decoder framework. In this setup, a convolutional neural network (CNN) extracts feature representations from the input image, which then guide the decoder in a sequence-to-sequence caption generation process. Although much research has focused on refining the decoder, the quality of image representations from the encoder remains crucial for accurate captioning. This paper introduces a novel approach that integrates features from two distinct CNN based encoders, capturing complementary information to enhance caption generation. Additionally, we propose a weighted averaging technique to combine the outputs of all GRUs in the stacked decoder. Furthermore, a comparison-based beam search strategy is incorporated to refine caption selection. The results demonstrate that our fusion-based approach, along with the enhanced stacked decoder, significantly outperforms both the transformer-based state-of-the-art model and other LSTM-based baselines.
Wholly-WOOD: Wholly Leveraging Diversified-quality Labels for Weakly-supervised Oriented Object Detection
Yu, Yi, Yang, Xue, Li, Yansheng, Han, Zhenjun, Da, Feipeng, Yan, Junchi
Accurately estimating the orientation of visual objects with compact rotated bounding boxes (RBoxes) has become a prominent demand, which challenges existing object detection paradigms that only use horizontal bounding boxes (HBoxes). To equip the detectors with orientation awareness, supervised regression/classification modules have been introduced at the high cost of rotation annotation. Meanwhile, some existing datasets with oriented objects are already annotated with horizontal boxes or even single points. It becomes attractive yet remains open for effectively utilizing weaker single point and horizontal annotations to train an oriented object detector (OOD). We develop Wholly-WOOD, a weakly-supervised OOD framework, capable of wholly leveraging various labeling forms (Points, HBoxes, RBoxes, and their combination) in a unified fashion. By only using HBox for training, our Wholly-WOOD achieves performance very close to that of the RBox-trained counterpart on remote sensing and other areas, significantly reducing the tedious efforts on labor-intensive annotation for oriented objects. The source codes are available at https://github.com/VisionXLab/whollywood (PyTorch-based) and https://github.com/VisionXLab/whollywood-jittor (Jittor-based).
Illegal Waste Detection in Remote Sensing Images: A Case Study
Gibellini, Federico, Fraternali, Piero, Boracchi, Giacomo, Morandini, Luca, Diecidue, Andrea, Malegori, Simona
Environmental crime currently represents the third largest criminal activity worldwide while threatening ecosystems as well as human health. Among the crimes related to this activity, improper waste management can nowadays be countered more easily thanks to the increasing availability and decreasing cost of Very-High-Resolution Remote Sensing images, which enable semi-automatic territory scanning in search of illegal landfills. This paper proposes a pipeline, developed in collaboration with professionals from a local environmental agency, for detecting candidate illegal dumping sites leveraging a classifier of Remote Sensing images. To identify the best configuration for such classifier, an extensive set of experiments was conducted and the impact of diverse image characteristics and training settings was thoroughly analyzed. The local environmental agency was then involved in an experimental exercise where outputs from the developed classifier were integrated in the experts' everyday work, resulting in time savings with respect to manual photo-interpretation. The classifier was eventually run with valuable results on a location outside of the training area, highlighting potential for cross-border applicability of the proposed pipeline.
Multi-modal Data Fusion and Deep Ensemble Learning for Accurate Crop Yield Prediction
Yewle, Akshay Dagadu, Mirzayeva, Laman, Karakuş, Oktay
This study introduces RicEns-Net, a novel Deep Ensemble model designed to predict crop yields by integrating diverse data sources through multimodal data fusion techniques. The research focuses specifically on the use of synthetic aperture radar (SAR), optical remote sensing data from Sentinel 1, 2, and 3 satellites, and meteorological measurements such as surface temperature and rainfall. The initial field data for the study were acquired through Ernst & Young's (EY) Open Science Challenge 2023. The primary objective is to enhance the precision of crop yield prediction by developing a machine-learning framework capable of handling complex environmental data. A comprehensive data engineering process was employed to select the most informative features from over 100 potential predictors, reducing the set to 15 features from 5 distinct modalities. This step mitigates the ``curse of dimensionality" and enhances model performance. The RicEns-Net architecture combines multiple machine learning algorithms in a deep ensemble framework, integrating the strengths of each technique to improve predictive accuracy. Experimental results demonstrate that RicEns-Net achieves a mean absolute error (MAE) of 341 kg/Ha (roughly corresponds to 5-6\% of the lowest average yield in the region), significantly exceeding the performance of previous state-of-the-art models, including those developed during the EY challenge.
An Empirical Study of Methods for Small Object Detection from Satellite Imagery
Yuan, Xiaohui, Chakravarty, Aniv, Gu, Lichuan, Wei, Zhenchun, Lichtenberg, Elinor, Chen, Tian
This paper reviews object detection methods for finding small objects from remote sensing imagery and provides an empirical evaluation of four state-of-the-art methods to gain insights into method performance and technical challenges. In particular, we use car detection from urban satellite images and bee box detection from satellite images of agricultural lands as application scenarios. Drawing from the existing surveys and literature, we identify several top-performing methods for the empirical study. Public, high-resolution satellite image datasets are used in our experiments.