Img2Loc: Revisiting Image Geolocalization using Multi-modality Foundation Models and Image-based Retrieval-Augmented Generation
Zhou, Zhongliang, Zhang, Jielu, Guan, Zihan, Hu, Mengxuan, Lao, Ni, Mu, Lan, Li, Sheng, Mai, Gengchen
–arXiv.org Artificial Intelligence
The field of visual recognition has witnessed a marked improvement, with state-of-the-art models significantly advancing in areas such as object classification [1, 2, 3, 4], object detection [5, 6, 7], semantic segmentation [8, 9, 10, 11], scene parsing [12, 13], disaster response [14, 15], environmental monitoring[16] among others [17, 18, 19]. As progress moves forward, the information retrieval community is widening its focus to include the prediction of more detailed and intricate attributes of information. A key attribute in this expanded scope is image geolocalization [20, 21, 22], which aims to determine the exact geographic coordinates given an image. The ability to accurately geolocalize images is crucial, as it provides possibilities for deducing a wide array of related attributes, such as temperature, elevation, crime rate, population density, and income level, providing a comprehensive insight into the context surrounding the image. In our study, we delve into predicting the geographic coordinates of a photograph solely from the ground-view image. Predictions are considered accurate if they closely match the actual location (Figure 1). Prevailing research approaches fall under the categories of either retrieval-based or classification-based methods. Retrieval-based techniques compare query images against a geo-tagged image database [23, 24, 25, 26, 27, 28, 29, 30], using the location of the image that closest matches the query image to infer its location.
arXiv.org Artificial Intelligence
Mar-28-2024