Chain-of-Ground: Improving GUI Grounding via Iterative Reasoning and Reference Feedback
Li, Aiden Yiliu, Yu, Bizhi, Lei, Daoan, Ren, Tianhe, Liu, Shilong
–arXiv.org Artificial Intelligence
GUI grounding aims to align natural-language instructions with precise regions in complex user interfaces (UIs). While advanced MLLMs have demonstrated strong capabilities in visual GUI grounding, they still struggle with small or visually similar targets, and ambiguity in real-world layouts. We argue that these limitations stem not only from the models' inherent grounding capacity, but also from an overlooked un-derutilization of their existing reasoning potential. To address this, we present Chain-of-Ground (CoG), a training-free multi-step grounding framework that leverages MLLMs for iterative visual reasoning and refinement. Instead of relying on direct prediction, Chain-of-Ground enables the model to progressively reflect and adjust its hypotheses, achieving more accurate and interpretable localization. Our approach establishes a new state of the art on the ScreenSpot-Pro benchmark with 68.4% accuracy, surpassing the previous best by 4.8%. To evaluate real-world generalization, we introduce TPanel-UI, a dataset of 420 labeled industrial control panels featuring visual distortions such as blur and masking to test robustness. On TPanel-UI, Chain-of-Ground outperforms the SOTA MLLM Qwen3-VL-235B by 6.9%, demonstrating the effectiveness of multi-step, training-free grounding across real-world and digital interfaces. Together, these results point to a new direction for unlocking MLLMs' grounding potential, through structured, iterative refinement rather than additional training.
arXiv.org Artificial Intelligence
Dec-2-2025