UI-Vision: A Desktop-centric GUI Benchmark for Visual Perception and Interaction

Nayak, Shravan, Jian, Xiangru, Lin, Kevin Qinghong, Rodriguez, Juan A., Kalsi, Montek, Awal, Rabiul, Chapados, Nicolas, Özsu, M. Tamer, Agrawal, Aishwarya, Vazquez, David, Pal, Christopher, Taslakian, Perouz, Gella, Spandana, Rajeswar, Sai

Mar-19-2025–arXiv.org Artificial Intelligence

Autonomous agents that navigate Graphical User Interfaces (GUIs) to automate tasks like document editing and file management can greatly enhance computer workflows. While existing research focuses on online settings, desktop environments, critical for many professional and everyday tasks, remain underexplored due to data collection challenges and licensing issues. We introduce UI-Vision, the first comprehensive, license-permissive benchmark for offline, fine-grained evaluation of computer use agents in real-world desktop environments. Unlike online benchmarks, UI-Vision provides: (i) dense, high-quality annotations of human demonstrations, including bounding boxes, UI labels, and action trajectories (clicks, drags, and keyboard inputs) across 83 software applications, and (ii) three fine-to-coarse grained tasks-Element Grounding, Layout Grounding, and Action Prediction-with well-defined metrics to rigorously evaluate agents' performance in desktop environments. Our evaluation reveals critical limitations in state-of-the-art models like UI-TARS-72B, including issues with understanding professional software, spatial reasoning, and complex actions like drag-and-drop. These findings highlight the challenges in developing fully autonomous computer use agents. By releasing UI-Vision as open-source, we aim to advance the development of more capable agents for real-world desktop tasks.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

Mar-19-2025

arXiv.org PDF

Add feedback

Country:
- South America (0.04)
- North America
  - Central America (0.04)
  - Canada > Quebec
    - Montreal (0.04)
- Asia
  - Singapore (0.04)
  - India (0.04)
  - Japan > Honshū
    - Chūbu > Toyama Prefecture > Toyama (0.04)

Genre:
- Research Report > New Finding (0.46)

Industry:
- Information Technology (0.93)

Technology:
- Information Technology
  - Software (1.00)
  - Graphics (1.00)
  - Communications (1.00)
  - Human Computer Interaction > Interfaces (0.89)
  - Artificial Intelligence
    - Representation & Reasoning (1.00)
    - Natural Language
      - Large Language Model (1.00)
      - Chatbot (1.00)
    - Machine Learning > Neural Networks
      - Deep Learning (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found