Geo-R1: Unlocking VLM Geospatial Reasoning with Cross-View Reinforcement Learning

Xu, Chenhui, Yu, Fuxun, Bianco, Michael J., Kovarskiy, Jacob, Tang, Raphael, Zhang, Qi, Xu, Zirui, LeVine, Will, Dubbs, Brandon, Liao, Heming, Burgess, Cassandra, Bag, Suvam, Patravali, Jay, Kukal, Rupanjali, Figueroa, Mikael, Madhok, Rishi, Karianakis, Nikolaos, Xiong, Jinjun

Oct-2-2025–arXiv.org Artificial Intelligence

We introduce Geo-R1, a reasoning-centric post-training framework that unlocks geospatial reasoning in vision-language models by combining thinking scaffolding and elevating. In the scaffolding stage, Geo-R1 instills a "geospatial thinking paradigm" via supervised fine-tuning on synthetic chain-of-thought exemplars, enabling models to connect visual cues with geographic priors without costly human reasoning annotations. In the elevating stage, it uses GRPO-based reinforcement learning on a weakly-supervised cross-view pairing proxy. This design supplies a verifiable and scalable reward signal: teaching models to capture and reconcile features across modalities, and harnessing reasoning for accurate prediction. Geo-R1 extends geospatial modeling from domain pretraining / supervised finetuning to reasoning-first post-training, and achieves state-of-the-art performance across various geospatial reasoning benchmarks. Our model is available at https://huggingface.co/miniHui/Geo-R1. Figure 1: Geo-R1 significantly outperforms baseline Bai et al. (2025) across 13 verifiable geo-reasoning tasks on the GeoChain benchmark (Y er-ramilli et al., 2025) in the zero-shot setting. See Table 6 for detailed description of these tasks. Geospatial reasoning is fundamental to a wide range of scientific and societal applications, spanning disaster response, search and rescue, urban planning, environmental monitoring, and sociocultural study. Unlike common vision-language reasoning (Li et al., 2024) centering around object recognition, captioning and general question-answering, geospatial reasoning spans many modalities (e.g., aerial imagery, streetview photos, location metadata, place information, etc.), and varied tasks (e.g., geographical, environmental, sociocultural, etc.) as shown in Figure 1. This blend of multimodal evidence and knowledge-intensive tasking makes general reasoning both crucial for geospatial understanding, and also uniquely challenging. While effective in natural domains, SFT is poorly suited in geospatial settings. Geospatial raw data can be plentiful, but supervisions are sparse, usually limited to coordinate metadata without descriptive content.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

Oct-2-2025

arXiv.org PDF

Add feedback

Country:
- North America > United States (0.28)
- Asia (0.28)

Genre:
- Research Report > New Finding (0.46)

Industry:
- Education (0.34)

Technology:
- Information Technology > Artificial Intelligence
  - Representation & Reasoning > Spatial Reasoning (1.00)
  - Natural Language
    - Large Language Model (1.00)
    - Chatbot (0.94)
  - Machine Learning > Neural Networks
    - Deep Learning (0.94)