Img2Loc: Revisiting Image Geolocalization using Multi-modality Foundation Models and Image-based Retrieval-Augmented Generation

Zhou, Zhongliang, Zhang, Jielu, Guan, Zihan, Hu, Mengxuan, Lao, Ni, Mu, Lan, Li, Sheng, Mai, Gengchen

arXiv.org Artificial Intelligence 

The field of visual recognition has witnessed a marked improvement, with state-of-the-art models significantly advancing in areas such as object classification [1, 2, 3, 4], object detection [5, 6, 7], semantic segmentation [8, 9, 10, 11], scene parsing [12, 13], disaster response [14, 15], environmental monitoring[16] among others [17, 18, 19]. As progress moves forward, the information retrieval community is widening its focus to include the prediction of more detailed and intricate attributes of information. A key attribute in this expanded scope is image geolocalization [20, 21, 22], which aims to determine the exact geographic coordinates given an image. The ability to accurately geolocalize images is crucial, as it provides possibilities for deducing a wide array of related attributes, such as temperature, elevation, crime rate, population density, and income level, providing a comprehensive insight into the context surrounding the image. In our study, we delve into predicting the geographic coordinates of a photograph solely from the ground-view image. Predictions are considered accurate if they closely match the actual location (Figure 1). Prevailing research approaches fall under the categories of either retrieval-based or classification-based methods. Retrieval-based techniques compare query images against a geo-tagged image database [23, 24, 25, 26, 27, 28, 29, 30], using the location of the image that closest matches the query image to infer its location.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found