TEOChat: A Large Vision-Language Assistant for Temporal Earth Observation Data
Irvin, Jeremy Andrew, Liu, Emily Ruoyu, Chen, Joyce Chuyi, Dormoy, Ines, Kim, Jinyoung, Khanna, Samar, Zheng, Zhuo, Ermon, Stefano
–arXiv.org Artificial Intelligence
Large vision and language assistants have enabled new capabilities for interpreting natural images. These approaches have recently been adapted to earth observation data, but they are only able to handle single image inputs, limiting their use for many real-world tasks. In this work, we develop a new vision and language assistant called TEOChat that can engage in conversations about temporal sequences of earth observation data. To train TEOChat, we curate an instructionfollowing dataset composed of many single image and temporal tasks including building change and damage assessment, semantic change detection, and temporal scene classification. We show that TEOChat can perform a wide variety of spatial and temporal reasoning tasks, substantially outperforming previous vision and language assistants, and even achieving comparable or better performance than specialist models trained to perform these specific tasks. Furthermore, TEOChat achieves impressive zero-shot performance on a change detection and change question answering dataset, outperforms GPT-4o and Gemini 1.5 Pro on multiple temporal tasks, and exhibits stronger single image capabilities than a comparable single EO image instruction-following model. Many earth observation (EO) tasks require the ability to reason over time. For example, change detection is a widely studied task where the goal is to identify salient changes in a region using multiple EO images capturing the region at different times (Chughtai et al., 2021; Bai et al., 2023; Cheng et al., 2023). Previous methods to automatically detect change in EO imagery have been specialist models, constraining their use to a single task or small set of tasks that they were explicitly trained to perform (Bai et al., 2023; Cheng et al., 2023). Advancements in the modeling of multimodal data have enabled generalist vision-language models (VLMs) that can perform a variety of natural image interpretation tasks specified flexibly through natural language (Achiam et al., 2023; Team et al., 2023; Liu et al., 2023). However, no prior VLMs can model temporal EO data (left of Figure 1), notably including change detection tasks. We investigate the performance of Video-LLaVA (Lin et al., 2023), a strong natural image pre-trained VLM that can receive images and videos as input, and GeoChat (Kuckreja et al., 2023), a strong VLM fine-tuned on single EO image tasks (right of Figure 1). We find that Video-LLaVA generates inaccurate information, likely because it has primarily been trained on natural images and videos, whereas GeoChat can only input single images and cannot process information across time. TEOChat is the first VLM to model temporal earth observation (EO) data. We compare a temporal VLM (Video-LLaVA (Lin et al., 2023)) and an EO VLM (GeoChat (Kuckreja et al., 2023)) with TEOChat.
arXiv.org Artificial Intelligence
Oct-8-2024
- Country:
- Asia (0.28)
- North America > United States (0.28)
- Genre:
- Overview (0.46)
- Research Report (0.50)
- Industry:
- Food & Agriculture > Agriculture (0.46)
- Technology: