Improving GUI Grounding with Explicit Position-to-Coordinate Mapping

Wang, Suyuchen, Zhang, Tianyu, Masry, Ahmed, Pal, Christopher, Gella, Spandana, Liu, Bang, Taslakian, Perouz

Oct-6-2025–arXiv.org Artificial Intelligence

GUI grounding is the task of mapping natural language instructions to precise pixel coordinates in graphical user interfaces, enabling autonomous agents to interact with software as humans do (Zhang et al., 2025a; Wang et al., 2024a; Zheng et al., 2024). This capability is fundamental for computer automation: without accurate grounding, agents cannot click buttons, fill forms, or navigate interfaces reliably. Although early approaches relied on structured metadata from HTML/DOM trees or accessibility APIs (Li et al., 2020; Deng et al., 2023), these methods face significant limitations: they require access to the underlying UI structure, which is often unavailable in desktop applications, inconsistent across platforms, or completely absent in legacy systems. Pure vision-based grounding, which operates directly on screenshots, offers universal applicability across any visual interface without requiring special access or instrumentation (Qin et al., 2025; Wang et al., 2025b; Guo et al., 2025). This approach mirrors human interaction with GUIs and enables automation of any software visible on screen, from modern web applications to legacy desktop tools. Current vision-based approaches typically formulate GUI grounding as a coordinate generation task, where models output pixel positions as text tokens (e.g., "x=523, y=217").

machine learning, natural language, wang, (16 more...)

arXiv.org Artificial Intelligence

Oct-6-2025

arXiv.org PDF

Add feedback

Country:
- Asia > Thailand
  - Bangkok > Bangkok (0.04)
- North America > Canada
  - Quebec > Montreal (0.04)

Genre:
- Research Report (0.82)

Technology:
- Information Technology
  - Artificial Intelligence
    - Machine Learning (1.00)
    - Natural Language (1.00)
    - Representation & Reasoning > Agents (0.34)
    - Vision (1.00)
  - Graphics (1.00)
  - Human Computer Interaction > Interfaces (0.86)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found