Plotting

 Ding, Zichen


OS-Genesis: Automating GUI Agent Trajectory Construction via Reverse Task Synthesis

arXiv.org Artificial Intelligence

Graphical User Interface (GUI) agents powered by Vision-Language Models (VLMs) have demonstrated human-like computer control capability. Despite their utility in advancing digital automation, a critical bottleneck persists: collecting high-quality trajectory data for training. Common practices for collecting such data rely on human supervision or synthetic data generation through executing pre-defined tasks, which are either resource-intensive or unable to guarantee data quality. Moreover, these methods suffer from limited data diversity and significant gaps between synthetic data and real-world environments. To address these challenges, we propose OS-Genesis, a novel GUI data synthesis pipeline that reverses the conventional trajectory collection process. Instead of relying on pre-defined tasks, OS-Genesis enables agents first to perceive environments and perform step-wise interactions, then retrospectively derive high-quality tasks to enable trajectory-level exploration. A trajectory reward model is then employed to ensure the quality of the generated trajectories. We demonstrate that training GUI agents with OS-Genesis significantly improves their performance on highly challenging online benchmarks. In-depth analysis further validates OS-Genesis's efficiency and its superior data quality and diversity compared to existing synthesis methods. Our codes, data, and checkpoints are available at \href{https://qiushisun.github.io/OS-Genesis-Home/}{OS-Genesis Homepage}.


SEAGraph: Unveiling the Whole Story of Paper Review Comments

arXiv.org Artificial Intelligence

Peer review, as a cornerstone of scientific research, ensures the integrity and quality of scholarly work by providing authors with objective feedback for refinement. However, in the traditional peer review process, authors often receive vague or insufficiently detailed feedback, which provides limited assistance and leads to a more time-consuming review cycle. If authors can identify some specific weaknesses in their paper, they can not only address the reviewer's concerns but also improve their work. This raises the critical question of how to enhance authors' comprehension of review comments. In this paper, we present SEAGraph, a novel framework developed to clarify review comments by uncovering the underlying intentions behind them. We construct two types of graphs for each paper: the semantic mind graph, which captures the author's thought process, and the hierarchical background graph, which delineates the research domains related to the paper. A retrieval method is then designed to extract relevant content from both graphs, facilitating coherent explanations for the review comments. Extensive experiments show that SEAGraph excels in review comment understanding tasks, offering significant benefits to authors.


OS-ATLAS: A Foundation Action Model for Generalist GUI Agents

arXiv.org Artificial Intelligence

Existing efforts in building GUI agents heavily rely on the availability of robust commercial Vision-Language Models (VLMs) such as GPT-4o and GeminiPro-Vision. Practitioners are often reluctant to use open-source VLMs due to their significant performance lag compared to their closed-source counterparts, particularly in GUI grounding and Out-Of-Distribution (OOD) scenarios. To facilitate future research in this area, we developed OS-Atlas --a foundational GUI action model that excels at GUI grounding and OOD agentic tasks through innovations in both data and modeling. We have invested significant engineering effort in developing an open-source toolkit for synthesizing GUI grounding data across multiple platforms, including Windows, Linux, MacOS, Android, and the web. Leveraging this toolkit, we are releasing the largest open-source cross-platform GUI grounding corpus to date, which contains over 13 million GUI elements. This dataset, combined with innovations in model training, provides a solid foundation for OS-Atlas to understand GUI screenshots and generalize to unseen interfaces. Through extensive evaluation across six benchmarks spanning three different platforms (mobile, desktop, and web), OS-Atlas demonstrates significant performance improvements over previous state-of-the-art models. With the recent adoption of large language models (LLMs), the fantasy of building digital agents (Wu et al., 2024)--similar to JARVIS in The Iron Man--to automate daily tasks is evolving from science fiction into a tangible reality. Many current agents make decisions based on textual descriptions of the environments, such as HTML and accessibility trees, which is often lengthy (Zheng et al., 2024a), noisy (Cheng et al., 2024; WebAIM, 2024), and hard to acquire in practice. More recent studies (Cheng et al., 2024; Hong et al., 2024b; Li et al., 2024) have explored the use of large visionlanguage models (VLMs) to develop graphical user interfaces (GUI) agents capable of performing complex tasks simply by analyzing the screen - an information-complete medium for agent's decisionmaking, allowing for greater flexibility. At the core of a GUI agent lies an action model that enables GUI grounding - the process of transforming natural language instructions into executable actions within the operating system (e.g., clicking somewhere on the screen).


Let's Be Self-generated via Step by Step: A Curriculum Learning Approach to Automated Reasoning with Large Language Models

arXiv.org Artificial Intelligence

Prior to our efforts, there has already been work striving towards this goal. For example, Self-ICL (Chen et al., 2023) begins by prompting the LLM to generate few-shot new, diverse, and creative proxy queries tailored to the target task, and then solves each of that independently using the ZS-CoT manner, which in turn yields proxy exemplars for prompting LLMs to engage in reasoning. Auto-ICL (Yang et al., 2023) operates similarly to Self-ICL, but it differs in that Auto-ICL instructs the LLM to produce proxy queries that have the same structure as the given query. Analogical Prompting (Yasunaga et al., 2023) draws on the cognitive process of solving new problems from relevant past experiences, i.e., inspired by analogical reasoning, which prompts the language model to self-generate relevant examples in context before embarking on the solution of a given query. Notably, the one-pass generation mode employed in Analogical Prompting necessitates that the LLM possesses robust capabilities for both following instructions and generating responses. We revisit the aforementioned approaches and discern that their efficacy hinges on guiding the LLM to recall experiences relevant to the given query. However, solely considering such experiences may lead to the generation of proxy queries that are as challenging as the given query, along with corresponding erroneous proxy solutions, potentially misleading the solution of the original given query.


Automated Peer Reviewing in Paper SEA: Standardization, Evaluation, and Analysis

arXiv.org Artificial Intelligence

In recent years, the rapid increase in scientific papers has overwhelmed traditional review mechanisms, resulting in varying quality of publications. Although existing methods have explored the capabilities of Large Language Models (LLMs) for automated scientific reviewing, their generated contents are often generic or partial. To address the issues above, we introduce an automated paper reviewing framework SEA. It comprises of three modules: Standardization, Evaluation, and Analysis, which are represented by models SEA-S, SEA-E, and SEA-A, respectively. Initially, SEA-S distills data standardization capabilities of GPT-4 for integrating multiple reviews for a paper. Then, SEA-E utilizes standardized data for fine-tuning, enabling it to generate constructive reviews. Finally, SEA-A introduces a new evaluation metric called mismatch score to assess the consistency between paper contents and reviews. Moreover, we design a self-correction strategy to enhance the consistency. Extensive experimental results on datasets collected from eight venues show that SEA can generate valuable insights for authors to improve their papers.


OS-Copilot: Towards Generalist Computer Agents with Self-Improvement

arXiv.org Artificial Intelligence

Figure 1: Running examples of FRIDAY when deployed on MacOS and tasked with (1) preparing a focused working environment, (2) Calculating and drawing a chart in Excel, and (3) creating a website for OS-Copilot. The text at the bottom illustrates the subtasks taken by FRIDAY. For each set of examples, the figure at the top represents the initial OS state, while the one at the bottom depicts the final state after execution. Boxes/Ovals highlight the changes made by FRIDAY. Autonomous interaction with the computer has been a longstanding challenge with great potential, and the recent proliferation of large language models (LLMs) has markedly accelerated progress in building digital agents. However, most of these agents are designed to interact with a narrow domain, such as a specific software or website. This narrow focus constrains their applicability for general computer tasks. To this end, we introduce OS-Copilot, a framework to build generalist agents capable of interfacing with comprehensive elements in an operating system (OS), including the web, code terminals, files, multimedia, and various third-party applications. We use OS-Copilot to create FRIDAY, a self-improving embodied agent for automating general computer tasks. On GAIA, a general AI assistants benchmark, FRIDAY outperforms previous methods by 35%, showcasing strong generalization to unseen applications via accumulated skills from previous tasks. We also present numerical and quantitative evidence that FRIDAY learns to control and self-improve on Excel and Powerpoint with minimal supervision. Our OS-Copilot framework and empirical findings provide infrastructure and insights for future research toward more capable and general-purpose computer agents. From the 1920 novel R.U.R to characters like JARVIS in The Iron Man, throughout the past century, people have dreamed of building digital agents to automate daily work. However, current digital agents, like Microsoft's Cortana, are primarily tailored for simple tasks like setting the alarm yet struggling with complex human requests. Fortunately, advancements in large language models (LLMs) bring us closer to realizing the next generation of digital assistants. Efforts in building language agents (integrating LLMs into digital agents) have focused primarily on specific standalone applications, such as web browsers (Deng et al., 2023; Zhou et al., 2023), command-line terminals (Yang et al., 2023a; Qiao et al., 2023), the Minecraft game (Wang et al., 2023a), and database (Hu et al., 2023). In particular, there is a lack of exploration in developing language agents that can effectively interact with the entire operating system.