LRSCLIP: A Vision-Language Foundation Model for Aligning Remote Sensing Image with Longer Text

Chen, Weizhi, Chen, Jingbo, Deng, Yupeng, Chen, Jiansheng, Feng, Yuman, Xi, Zhihao, Liu, Diyou, Li, Kai, Meng, Yu

arXiv.org Artificial Intelligence 

--This study addresses the technical bottlenecks in handling long text and the "hallucination" issue caused by insufficient short text information in remote sensing vision-language foundation models (VLFM). We propose a novel vision-language foundation model, LRSCLIP, and a multimodal dataset, LRS2M. The main contributions are as follows: (1) By integrating multi-source remote sensing data and adopting a large language model labeling strategy, we construct the LRS2M dataset, which contains 2 million image-text pairs, providing both short and long texts for the first time, thus solving the problem of semantic granularity limitations in existing datasets; (2) The design of the LRSCLIP architecture based on Long-CLIP's KPS module, which extends CLIP's text processing capacity and achieves fine-grained cross-modal feature alignment through a dual-text loss weighting mechanism. Experimental results show that LRSCLIP improves retrieval accuracy by 10%-20% over the Long-CLIP baseline in the zero-shot long-text cross-modal retrieval task. For the zero-shot short-text cross-modal retrieval task, LRSCLIP achieves improvements over the current best model, GeoRSCLIP, with increases of 0.17%, 0.67%, and 0.92% in T ext to Image R@1, Image to T ext R@1, and mR on RSITMD, respectively, and 0.04%, 2.93%, and 1.28% on RSICD. This work provides a new benchmark model and data support for remote sensing multimodal learning. ECENT years have seen significant progress in foundation models (FM) within the fields of computer vision (CV) and natural language processing (NLP) [1] [2] [3] [4] [5] [6] [7] [8]. This research was funded by the National Key R&D Program of China under grant number 2021YFB3900504. Weizhi Chen, Kai Li are with Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100101, China, and also with School of Electronic, Electrical and Communication Engineering, University of Chinese Academy of Sciences, Beijing 100049, China. Jingbo Chen, Y upeng Deng, Jiansheng Chen, Zhihao Xi, Diyou Liu, Y u Meng are with Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100101, China. Y uman Feng is with the School of Information Network Security, People's Public Security University of China, Beijing 100038, China. Unlike models designed for specific task objectives, VLFM learns joint representations of massive image-text pairs in upstream tasks and then transfers this knowledge to various downstream tasks, demonstrating exceptional performance. Several outstanding VLFM models have already emerged, such as CLIP [10], BLIP [11] [12], and MaskVLM [13]. Meanwhile, researchers have begun exploring the application potential of VLFM in the remote sensing domain. However, VLFM often faces issues related to the long-tail effect (where a small number of classes dominate while the rest have fewer samples), making direct application to remote sensing tasks challenging [14].