Grounding Computer Use Agents on Human Demonstrations
Feizi, Aarash, Nayak, Shravan, Jian, Xiangru, Lin, Kevin Qinghong, Li, Kaixin, Awal, Rabiul, Lù, Xing Han, Obando-Ceron, Johan, Rodriguez, Juan A., Chapados, Nicolas, Vazquez, David, Romero-Soriano, Adriana, Rabbany, Reihaneh, Taslakian, Perouz, Pal, Christopher, Gella, Spandana, Rajeswar, Sai
–arXiv.org Artificial Intelligence
Building reliable computer-use agents requires grounding: accurately connecting natural language instructions to the correct on-screen elements. While large datasets exist for web and mobile interactions, high-quality resources for desktop environments are limited. CUA, a large-scale desktop grounding dataset built from expert human demonstrations. It covers 87 applications across 12 categories and includes 56K screenshots, with every on-screen element carefully annotated for a total of over 3.56M human-verified annotations. From these demonstrations, we generate diverse instructions that capture a wide range of real-world tasks, providing high-quality data for model training. These results demonstrate the critical role of high-quality, expert-driven datasets in advancing general-purpose computer-use agents. The vision of computer-use agents (CUA) that operate software on behalf of users has gained significant momentum with recent progress in multimodal large language model-based agents (OpenAI, 2025; Anthropic, 2024a; Qin et al., 2025; Wang et al., 2025a). These agents promise to automate routine work and make complex digital tools more accessible. For such agents to succeed, they must first plan the next step in a task, then ground the plan to the exact on-screen element to click, type, or drag. Accurate grounding is critical: without correctly identifying the right button or menu item, even a flawless plan cannot be executed. In FreeCAD, for instance, when asked to "open the color picker" (Figure 1), the agent must distinguish a small palette icon from look-alike tools, one of which it must precisely click. When grounding fails, the plan quickly veers off course, minor errors compound, and tasks ultimately fail (Nayak et al., 2025). Moreover, grounding in desktop applications is challenging due to their complexity and diversity. These applications often feature high-resolution displays with dense layouts and visually similar elements, making precise localization difficult.
arXiv.org Artificial Intelligence
Nov-11-2025
- Country:
- Asia (0.46)
- North America (0.28)
- Genre:
- Research Report > New Finding (0.66)
- Industry:
- Media (0.74)
- Information Technology (0.67)
- Technology: