OS-ATLAS: A Foundation Action Model for Generalist GUI Agents

Wu, Zhiyong, Wu, Zhenyu, Xu, Fangzhi, Wang, Yian, Sun, Qiushi, Jia, Chengyou, Cheng, Kanzhi, Ding, Zichen, Chen, Liheng, Liang, Paul Pu, Qiao, Yu

arXiv.org Artificial Intelligence 

Existing efforts in building GUI agents heavily rely on the availability of robust commercial Vision-Language Models (VLMs) such as GPT-4o and GeminiPro-Vision. Practitioners are often reluctant to use open-source VLMs due to their significant performance lag compared to their closed-source counterparts, particularly in GUI grounding and Out-Of-Distribution (OOD) scenarios. To facilitate future research in this area, we developed OS-Atlas --a foundational GUI action model that excels at GUI grounding and OOD agentic tasks through innovations in both data and modeling. We have invested significant engineering effort in developing an open-source toolkit for synthesizing GUI grounding data across multiple platforms, including Windows, Linux, MacOS, Android, and the web. Leveraging this toolkit, we are releasing the largest open-source cross-platform GUI grounding corpus to date, which contains over 13 million GUI elements. This dataset, combined with innovations in model training, provides a solid foundation for OS-Atlas to understand GUI screenshots and generalize to unseen interfaces. Through extensive evaluation across six benchmarks spanning three different platforms (mobile, desktop, and web), OS-Atlas demonstrates significant performance improvements over previous state-of-the-art models. With the recent adoption of large language models (LLMs), the fantasy of building digital agents (Wu et al., 2024)--similar to JARVIS in The Iron Man--to automate daily tasks is evolving from science fiction into a tangible reality. Many current agents make decisions based on textual descriptions of the environments, such as HTML and accessibility trees, which is often lengthy (Zheng et al., 2024a), noisy (Cheng et al., 2024; WebAIM, 2024), and hard to acquire in practice. More recent studies (Cheng et al., 2024; Hong et al., 2024b; Li et al., 2024) have explored the use of large visionlanguage models (VLMs) to develop graphical user interfaces (GUI) agents capable of performing complex tasks simply by analyzing the screen - an information-complete medium for agent's decisionmaking, allowing for greater flexibility. At the core of a GUI agent lies an action model that enables GUI grounding - the process of transforming natural language instructions into executable actions within the operating system (e.g., clicking somewhere on the screen).

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found