Goto

Collaborating Authors

 Irvin, Jeremy Andrew


TEOChat: A Large Vision-Language Assistant for Temporal Earth Observation Data

arXiv.org Artificial Intelligence

Large vision and language assistants have enabled new capabilities for interpreting natural images. These approaches have recently been adapted to earth observation data, but they are only able to handle single image inputs, limiting their use for many real-world tasks. In this work, we develop a new vision and language assistant called TEOChat that can engage in conversations about temporal sequences of earth observation data. To train TEOChat, we curate an instructionfollowing dataset composed of many single image and temporal tasks including building change and damage assessment, semantic change detection, and temporal scene classification. We show that TEOChat can perform a wide variety of spatial and temporal reasoning tasks, substantially outperforming previous vision and language assistants, and even achieving comparable or better performance than specialist models trained to perform these specific tasks. Furthermore, TEOChat achieves impressive zero-shot performance on a change detection and change question answering dataset, outperforms GPT-4o and Gemini 1.5 Pro on multiple temporal tasks, and exhibits stronger single image capabilities than a comparable single EO image instruction-following model. Many earth observation (EO) tasks require the ability to reason over time. For example, change detection is a widely studied task where the goal is to identify salient changes in a region using multiple EO images capturing the region at different times (Chughtai et al., 2021; Bai et al., 2023; Cheng et al., 2023). Previous methods to automatically detect change in EO imagery have been specialist models, constraining their use to a single task or small set of tasks that they were explicitly trained to perform (Bai et al., 2023; Cheng et al., 2023). Advancements in the modeling of multimodal data have enabled generalist vision-language models (VLMs) that can perform a variety of natural image interpretation tasks specified flexibly through natural language (Achiam et al., 2023; Team et al., 2023; Liu et al., 2023). However, no prior VLMs can model temporal EO data (left of Figure 1), notably including change detection tasks. We investigate the performance of Video-LLaVA (Lin et al., 2023), a strong natural image pre-trained VLM that can receive images and videos as input, and GeoChat (Kuckreja et al., 2023), a strong VLM fine-tuned on single EO image tasks (right of Figure 1). We find that Video-LLaVA generates inaccurate information, likely because it has primarily been trained on natural images and videos, whereas GeoChat can only input single images and cannot process information across time. TEOChat is the first VLM to model temporal earth observation (EO) data. We compare a temporal VLM (Video-LLaVA (Lin et al., 2023)) and an EO VLM (GeoChat (Kuckreja et al., 2023)) with TEOChat.


GEO-Bench: Toward Foundation Models for Earth Monitoring

arXiv.org Artificial Intelligence

Recent progress in self-supervision has shown that pre-training large neural networks on vast amounts of unsupervised data can lead to substantial increases in generalization to downstream tasks. Such models, recently coined foundation models, have been transformational to the field of natural language processing. Variants have also been proposed for image data, but their applicability to remote sensing tasks is limited. To stimulate the development of foundation models for Earth monitoring, we propose a benchmark comprised of six classification and six segmentation tasks, which were carefully curated and adapted to be both relevant to the field and well-suited for model evaluation. We accompany this benchmark with a robust methodology for evaluating models and reporting aggregated results to enable a reliable assessment of progress. Finally, we report results for 20 baselines to gain information about the performance of existing models. We believe that this benchmark will be a driver of progress across a variety of Earth monitoring tasks.