OS-Copilot: Towards Generalist Computer Agents with Self-Improvement

Wu, Zhiyong, Han, Chengcheng, Ding, Zichen, Weng, Zhenmin, Liu, Zhoumianze, Yao, Shunyu, Yu, Tao, Kong, Lingpeng

arXiv.org Artificial Intelligence 

Figure 1: Running examples of FRIDAY when deployed on MacOS and tasked with (1) preparing a focused working environment, (2) Calculating and drawing a chart in Excel, and (3) creating a website for OS-Copilot. The text at the bottom illustrates the subtasks taken by FRIDAY. For each set of examples, the figure at the top represents the initial OS state, while the one at the bottom depicts the final state after execution. Boxes/Ovals highlight the changes made by FRIDAY. Autonomous interaction with the computer has been a longstanding challenge with great potential, and the recent proliferation of large language models (LLMs) has markedly accelerated progress in building digital agents. However, most of these agents are designed to interact with a narrow domain, such as a specific software or website. This narrow focus constrains their applicability for general computer tasks. To this end, we introduce OS-Copilot, a framework to build generalist agents capable of interfacing with comprehensive elements in an operating system (OS), including the web, code terminals, files, multimedia, and various third-party applications. We use OS-Copilot to create FRIDAY, a self-improving embodied agent for automating general computer tasks. On GAIA, a general AI assistants benchmark, FRIDAY outperforms previous methods by 35%, showcasing strong generalization to unseen applications via accumulated skills from previous tasks. We also present numerical and quantitative evidence that FRIDAY learns to control and self-improve on Excel and Powerpoint with minimal supervision. Our OS-Copilot framework and empirical findings provide infrastructure and insights for future research toward more capable and general-purpose computer agents. From the 1920 novel R.U.R to characters like JARVIS in The Iron Man, throughout the past century, people have dreamed of building digital agents to automate daily work. However, current digital agents, like Microsoft's Cortana, are primarily tailored for simple tasks like setting the alarm yet struggling with complex human requests. Fortunately, advancements in large language models (LLMs) bring us closer to realizing the next generation of digital assistants. Efforts in building language agents (integrating LLMs into digital agents) have focused primarily on specific standalone applications, such as web browsers (Deng et al., 2023; Zhou et al., 2023), command-line terminals (Yang et al., 2023a; Qiao et al., 2023), the Minecraft game (Wang et al., 2023a), and database (Hu et al., 2023). In particular, there is a lack of exploration in developing language agents that can effectively interact with the entire operating system.