Improving GUI Grounding with Explicit Position-to-Coordinate Mapping
Wang, Suyuchen, Zhang, Tianyu, Masry, Ahmed, Pal, Christopher, Gella, Spandana, Liu, Bang, Taslakian, Perouz
–arXiv.org Artificial Intelligence
GUI grounding is the task of mapping natural language instructions to precise pixel coordinates in graphical user interfaces, enabling autonomous agents to interact with software as humans do (Zhang et al., 2025a; Wang et al., 2024a; Zheng et al., 2024). This capability is fundamental for computer automation: without accurate grounding, agents cannot click buttons, fill forms, or navigate interfaces reliably. Although early approaches relied on structured metadata from HTML/DOM trees or accessibility APIs (Li et al., 2020; Deng et al., 2023), these methods face significant limitations: they require access to the underlying UI structure, which is often unavailable in desktop applications, inconsistent across platforms, or completely absent in legacy systems. Pure vision-based grounding, which operates directly on screenshots, offers universal applicability across any visual interface without requiring special access or instrumentation (Qin et al., 2025; Wang et al., 2025b; Guo et al., 2025). This approach mirrors human interaction with GUIs and enables automation of any software visible on screen, from modern web applications to legacy desktop tools. Current vision-based approaches typically formulate GUI grounding as a coordinate generation task, where models output pixel positions as text tokens (e.g., "x=523, y=217").
arXiv.org Artificial Intelligence
Oct-6-2025
- Genre:
- Research Report (0.82)
- Technology:
- Information Technology
- Artificial Intelligence
- Machine Learning (1.00)
- Natural Language (1.00)
- Representation & Reasoning > Agents (0.34)
- Vision (1.00)
- Graphics (1.00)
- Human Computer Interaction > Interfaces (0.86)
- Artificial Intelligence
- Information Technology