"Are We Done Yet?": A Vision-Based Judge for Autonomous Task Completion of Computer Use Agents

Nov-26-2025–arXiv.org Artificial Intelligence

Computer Use Agents (CUAs) are designed to autonomously operate digital interfaces, yet they often fail to reliably determine whether a given task has been successfully completed. We present an autonomous evaluation and feedback framework that leverages Vision-Language Models (VLMs) to assess task completion directly from screenshots and task descriptions. Our dataset covers 42 built-in macOS applications and 1,260 human-labeled tasks, covering a wide range of scenarios. Our framework achieves up to 73% classification accuracy in task success detection and yields an average relative improvement of 27% in the overall task success rate of CUAs when evaluator feedback is applied. These results demonstrate that vision-based evaluation can serve as an actionable feedback mechanism that significantly improves the reliability and self-correction of autonomous computer-use agents.

artificial intelligence, machine learning, natural language, (15 more...)

arXiv.org Artificial Intelligence

Nov-26-2025

arXiv.org PDF

Add feedback

Genre:
- Research Report (0.70)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning (1.00)
  - Natural Language (1.00)
  - Representation & Reasoning > Agents (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found