Grounding Computer Use Agents on Human Demonstrations

Feizi, Aarash, Nayak, Shravan, Jian, Xiangru, Lin, Kevin Qinghong, Li, Kaixin, Awal, Rabiul, Lù, Xing Han, Obando-Ceron, Johan, Rodriguez, Juan A., Chapados, Nicolas, Vazquez, David, Romero-Soriano, Adriana, Rabbany, Reihaneh, Taslakian, Perouz, Pal, Christopher, Gella, Spandana, Rajeswar, Sai

Nov-11-2025–arXiv.org Artificial Intelligence

Building reliable computer-use agents requires grounding: accurately connecting natural language instructions to the correct on-screen elements. While large datasets exist for web and mobile interactions, high-quality resources for desktop environments are limited. CUA, a large-scale desktop grounding dataset built from expert human demonstrations. It covers 87 applications across 12 categories and includes 56K screenshots, with every on-screen element carefully annotated for a total of over 3.56M human-verified annotations. From these demonstrations, we generate diverse instructions that capture a wide range of real-world tasks, providing high-quality data for model training. These results demonstrate the critical role of high-quality, expert-driven datasets in advancing general-purpose computer-use agents. The vision of computer-use agents (CUA) that operate software on behalf of users has gained significant momentum with recent progress in multimodal large language model-based agents (OpenAI, 2025; Anthropic, 2024a; Qin et al., 2025; Wang et al., 2025a). These agents promise to automate routine work and make complex digital tools more accessible. For such agents to succeed, they must first plan the next step in a task, then ground the plan to the exact on-screen element to click, type, or drag. Accurate grounding is critical: without correctly identifying the right button or menu item, even a flawless plan cannot be executed. In FreeCAD, for instance, when asked to "open the color picker" (Figure 1), the agent must distinguish a small palette icon from look-alike tools, one of which it must precisely click. When grounding fails, the plan quickly veers off course, minor errors compound, and tasks ultimately fail (Nayak et al., 2025). Moreover, grounding in desktop applications is challenging due to their complexity and diversity. These applications often feature high-resolution displays with dense layouts and visually similar elements, making precise localization difficult.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

Nov-11-2025

arXiv.org PDF

Add feedback

Country:
- Asia (0.46)
- North America (0.28)

Genre:
- Research Report > New Finding (0.66)

Industry:
- Media (0.74)
- Information Technology (0.67)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Large Language Model (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (0.67)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found