Chain-of-Ground: Improving GUI Grounding via Iterative Reasoning and Reference Feedback

Li, Aiden Yiliu, Yu, Bizhi, Lei, Daoan, Ren, Tianhe, Liu, Shilong

Dec-2-2025–arXiv.org Artificial Intelligence

GUI grounding aims to align natural-language instructions with precise regions in complex user interfaces (UIs). While advanced MLLMs have demonstrated strong capabilities in visual GUI grounding, they still struggle with small or visually similar targets, and ambiguity in real-world layouts. We argue that these limitations stem not only from the models' inherent grounding capacity, but also from an overlooked un-derutilization of their existing reasoning potential. To address this, we present Chain-of-Ground (CoG), a training-free multi-step grounding framework that leverages MLLMs for iterative visual reasoning and refinement. Instead of relying on direct prediction, Chain-of-Ground enables the model to progressively reflect and adjust its hypotheses, achieving more accurate and interpretable localization. Our approach establishes a new state of the art on the ScreenSpot-Pro benchmark with 68.4% accuracy, surpassing the previous best by 4.8%. To evaluate real-world generalization, we introduce TPanel-UI, a dataset of 420 labeled industrial control panels featuring visual distortions such as blur and masking to test robustness. On TPanel-UI, Chain-of-Ground outperforms the SOTA MLLM Qwen3-VL-235B by 6.9%, demonstrating the effectiveness of multi-step, training-free grounding across real-world and digital interfaces. Together, these results point to a new direction for unlocking MLLMs' grounding potential, through structured, iterative refinement rather than additional training.

large language model, machine learning, natural language, (22 more...)

arXiv.org Artificial Intelligence

Dec-2-2025

arXiv.org PDF

Add feedback

Country:
- Asia (0.28)

Genre:
- Research Report > New Finding (0.67)
- Workflow (0.93)

Technology:
- Information Technology
  - Artificial Intelligence
    - Machine Learning > Neural Networks
      - Deep Learning (0.68)
    - Natural Language > Large Language Model (1.00)
  - Graphics (0.95)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found