\textsc{GUI-Spotlight}: Adaptive Iterative Focus Refinement for Enhanced GUI Visual Grounding
Lei, Bin, Xu, Nuo, Payani, Ali, Hong, Mingyi, Liao, Chunhua, Cao, Yu, Ding, Caiwen
–arXiv.org Artificial Intelligence
Multimodal large language models (MLLMs) have markedly expanded the competence of graphical user-interface (GUI) systems, propelling them beyond controlled simulations into complex, real-world environments across diverse platforms. However, practical usefulness is still bounded by the reliability of visual grounding, i.e., mapping textual references to exact on-screen elements. This limitation prevents the system from accurately performing pointer-level actions such as clicking or dragging. On the ScreenSpot-Pro benchmark, GUI-Spotlight trained with only 18.5K training samples achieves 52.8% accuracy, surpassing V2P-7B (50.6% with 9.6M training samples) and GT A-1-7B (50.1% with 1.56M training samples). Recent rapid advances in multimodal large language models have driven swift progress in GUI agents capable of handling complex tasks on general graphical user interfaces (GUIs) (Xie et al., 2024; Wu et al., 2024a). Nevertheless, current GUI agents still lack robust, fine-grained visual grounding, making it difficult to translate what to do into where to act on complex, dynamically changing screens (Jang et al., 2024; Xie et al., 2025).
arXiv.org Artificial Intelligence
Oct-7-2025
- Genre:
- Research Report (1.00)
- Technology:
- Information Technology
- Graphics (1.00)
- Artificial Intelligence
- Vision (1.00)
- Natural Language > Large Language Model (0.86)
- Information Technology