\textsc{GUI-Spotlight}: Adaptive Iterative Focus Refinement for Enhanced GUI Visual Grounding

Lei, Bin, Xu, Nuo, Payani, Ali, Hong, Mingyi, Liao, Chunhua, Cao, Yu, Ding, Caiwen

arXiv.org Artificial Intelligence 

Multimodal large language models (MLLMs) have markedly expanded the competence of graphical user-interface (GUI) systems, propelling them beyond controlled simulations into complex, real-world environments across diverse platforms. However, practical usefulness is still bounded by the reliability of visual grounding, i.e., mapping textual references to exact on-screen elements. This limitation prevents the system from accurately performing pointer-level actions such as clicking or dragging. On the ScreenSpot-Pro benchmark, GUI-Spotlight trained with only 18.5K training samples achieves 52.8% accuracy, surpassing V2P-7B (50.6% with 9.6M training samples) and GT A-1-7B (50.1% with 1.56M training samples). Recent rapid advances in multimodal large language models have driven swift progress in GUI agents capable of handling complex tasks on general graphical user interfaces (GUIs) (Xie et al., 2024; Wu et al., 2024a). Nevertheless, current GUI agents still lack robust, fine-grained visual grounding, making it difficult to translate what to do into where to act on complex, dynamically changing screens (Jang et al., 2024; Xie et al., 2025).