osworld
OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments
Autonomous agents that accomplish complex computer tasks with minimal human interventions have the potential to transform human-computer interaction, significantly enhancing accessibility and productivity. However, existing benchmarks either lack an interactive environment or are limited to environments specific to certain applications or domains, failing to reflect the diverse and complex nature of real-world computer use, thereby limiting the scope of tasks and agent scalability. To address this issue, we introduce OSWorld, the first-of-its-kind scalable, real computer environment for multimodal agents, supporting task setup, execution-based evaluation, and interactive learning across various operating systems such as Ubuntu, Windows, and macOS. OSWorld can serve as a unified, integrated computer environment for assessing open-ended computer tasks that involve arbitrary applications. Building upon OSWorld, we create a benchmark of 369 computer tasks involving real web and desktop apps in open domains, OS file I/O, and workflows spanning multiple applications.
STEP: Success-Rate-Aware Trajectory-Efficient Policy Optimization
Chen, Yuhan, Liu, Yuxuan, Zhang, Long, Gao, Pengzhi, Luan, Jian, Liu, Wei
Multi-turn interaction remains challenging for online reinforcement learning. A common solution is trajectory-level optimization, which treats each trajectory as a single training sample. However, this approach can be inefficient and yield misleading learning signals: it applies uniform sampling across tasks regardless of difficulty, penalizes correct intermediate actions in failed trajectories, and incurs high sample-collection costs. To address these issues, we propose STEP (Success-rate-aware Trajectory-Efficient Policy optimization), a framework that dynamically allocates sampling based on per-task success rates and performs step-level optimization. STEP maintains a smoothed success-rate record to guide adaptive trajectory resampling, allocating more effort to harder tasks. It then computes success-rate-weighted advantages and decomposes trajectories into step-level samples. Finally, it applies a step-level GRPO augmentation to refine updates for low-success tasks. Experiments on OSWorld and AndroidWorld show that STEP substantially improves sample efficiency and training stability over trajectory-level GRPO, converging faster and generalizing better under the same sampling budget.
- Asia > Japan > Honshū > Chūbu > Toyama Prefecture > Toyama (0.04)
- Asia > China > Hubei Province > Wuhan (0.04)
- Workflow (0.68)
- Research Report (0.64)
Learning from Online Videos at Inference Time for Computer-Use Agents
Liu, Yujian, Wang, Ze, Chen, Hao, Sun, Ximeng, Yu, Xiaodong, Wu, Jialian, Liu, Jiang, Barsoum, Emad, Liu, Zicheng, Chang, Shiyu
Computer-use agents can operate computers and automate laborious tasks, but despite recent rapid progress, they still lag behind human users, especially when tasks require domain-specific procedural knowledge about particular applications, platforms, and multi-step workflows. Humans can bridge this gap by watching video tutorials: we search, skim, and selectively imitate short segments that match our current subgoal. In this paper, we study how to enable computer-use agents to learn from online videos at inference time effectively. We propose a framework that retrieves and filters tutorial videos, converts them into structured demonstration trajectories, and dynamically selects trajectories as in-context guidance during execution. Particularly, using a VLM, we infer UI actions, segment videos into short subsequences of actions, and assign each subsequence a textual objective. At inference time, a two-stage selection mechanism dynamically chooses a single trajectory to add in context at each step, focusing the agent on the most helpful local guidance for its next decision. Experiments on two widely used benchmarks show that our framework consistently outperforms strong base agents and variants that use only textual tutorials or transcripts. Analyses highlight the importance of trajectory segmentation and selection, action filtering, and visual information, suggesting that abundant online videos can be systematically distilled into actionable guidance that improves computer-use agents at inference time. Our code is available at https://github.com/UCSB-NLP-Chang/video_demo.
- Workflow (1.00)
- Research Report > New Finding (0.67)
- Instructional Material > Course Syllabus & Notes (0.46)
OS-Harm: A Benchmark for Measuring Safety of Computer Use Agents
Kuntz, Thomas, Duzan, Agatha, Zhao, Hao, Croce, Francesco, Kolter, Zico, Flammarion, Nicolas, Andriushchenko, Maksym
Computer use agents are LLM-based agents that can directly interact with a graphical user interface, by processing screenshots or accessibility trees. While these systems are gaining popularity, their safety has been largely overlooked, despite the fact that evaluating and understanding their potential for harmful behavior is essential for widespread adoption. To address this gap, we introduce OS-Harm, a new benchmark for measuring safety of computer use agents. OS-Harm is built on top of the OSWorld environment and aims to test models across three categories of harm: deliberate user misuse, prompt injection attacks, and model misbehavior. To cover these cases, we create 150 tasks that span several types of safety violations (harassment, copyright infringement, disinformation, data exfiltration, etc.) and require the agent to interact with a variety of OS applications (email client, code editor, browser, etc.). Moreover, we propose an automated judge to evaluate both accuracy and safety of agents that achieves high agreement with human annotations (0.76 and 0.79 F1 score). We evaluate computer use agents based on a range of frontier models - such as o4-mini, Claude 3.7 Sonnet, Gemini 2.5 Pro - and provide insights into their safety. In particular, all models tend to directly comply with many deliberate misuse queries, are relatively vulnerable to static prompt injections, and occasionally perform unsafe actions. The OS-Harm benchmark is available at https://github.com/tml-epfl/os-harm.
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
- Workflow (0.93)
- Law (1.00)
- Information Technology > Security & Privacy (1.00)
- Government > Regional Government > North America Government > United States Government (0.67)
R-WoM: Retrieval-augmented World Model For Computer-use Agents
Mei, Kai, Guo, Jiang, Chang, Shuaichen, Dong, Mingwen, Lee, Dongkyu, Niu, Xing, Jiang, Jiarong
Large Language Models (LLMs) can serve as world models to enhance agent decision-making in digital environments by simulating future states and predicting action outcomes, potentially eliminating costly trial-and-error exploration. However, this capability is fundamentally limited by LLM's tendency to hallucination and their reliance on static training knowledge, which could lead to compounding errors that inhibit long-horizon simulations. To systematically investigate whether LLMs are appropriate for world modeling, we probe two core capabilities of world models - future state prediction and reward estimation - through three tasks: next-state identification, full-procedure planning alignment, and milestone transition recognition. Our analysis shows that while LLMs effectively capture immediate next states and identify meaningful state transitions, their performance rapidly degrades in full-procedure planning. This highlights LLMs' limitations in reliably modeling environment dynamics over long horizons. To address these limitations, we propose the Retrieval-augmented World Model (R-WoM), which grounds LLM simulations by incorporating factual, up-to-date knowledge retrieved from external tutorials. Experiments show that R-WoM achieves substantial improvements of up to 25.3% (OSWorld) and 18.1% (WebArena) compared to baselines, with particular advantage in longer-horizon simulations. World models have evolved from early symbolic planning systems to sophisticated neural architectures that learn latent representations of environment dynamics. Model-based reinforcement learning (MBRL) approaches, such as Dreamer v1-3 (Hafner et al., 2019; 2020; 2023) and MuZero (Schrittwieser et al., 2020), learn latent world models to "imagine" trajectories before selecting actions. More recently, Large Language Model (LLM)-based world models (Hao et al., 2023; Wang et al., 2024; Zhang et al., 2024) have emerged as a new paradigm, leveraging large-scale pre-training to reason about action consequences in realistic digital environments.
- Research Report (1.00)
- Workflow (0.68)
- Instructional Material > Course Syllabus & Notes (0.46)
The Unreasonable Effectiveness of Scaling Agents for Computer Use
Gonzalez-Pumariega, Gonzalo, Tu, Vincent, Lee, Chih-Lun, Yang, Jiachen, Li, Ang, Wang, Xin Eric
Computer-use agents (CUAs) hold promise for automating everyday digital tasks, but their unreliability and high variance hinder their application to long-horizon, complex tasks. We introduce Behavior Best-of-N (bBoN), a method that scales over agents by generating multiple rollouts and selecting among them using behavior narratives that describe the agents' rollouts. It enables both wide exploration and principled trajectory selection, substantially improving robustness and success rates. On OSWorld, our bBoN scaling method establishes a new state of the art (SoTA) at 69.9%, significantly outperforming prior methods and approaching human-level performance at 72%, with comprehensive ablations validating key design choices. We further demonstrate strong generalization results to different operating systems on WindowsAgentArena and AndroidWorld. Crucially, our results highlight the unreasonable effectiveness of scaling CUAs, when you do it right: effective scaling requires structured trajectory understanding and selection, and bBoN provides a practical framework to achieve this.
- North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
- North America > Canada > British Columbia > Vancouver (0.04)
- Asia > Thailand > Bangkok > Bangkok (0.04)
- (2 more...)
- Workflow (0.93)
- Research Report > New Finding (0.48)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.49)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.46)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Agents > Agent Societies (0.34)
OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments
Autonomous agents that accomplish complex computer tasks with minimal human interventions have the potential to transform human-computer interaction, significantly enhancing accessibility and productivity. However, existing benchmarks either lack an interactive environment or are limited to environments specific to certain applications or domains, failing to reflect the diverse and complex nature of real-world computer use, thereby limiting the scope of tasks and agent scalability. To address this issue, we introduce OSWorld, the first-of-its-kind scalable, real computer environment for multimodal agents, supporting task setup, execution-based evaluation, and interactive learning across various operating systems such as Ubuntu, Windows, and macOS. OSWorld can serve as a unified, integrated computer environment for assessing open-ended computer tasks that involve arbitrary applications. Building upon OSWorld, we create a benchmark of 369 computer tasks involving real web and desktop apps in open domains, OS file I/O, and workflows spanning multiple applications.
Meet The AI Agent With Multiple Personalities
In the coming years, agents are widely expected to take over more and more chores on behalf of humans, including using computers and smartphones. For now, though, they're too error prone to be much use. A new agent called S2, created by the startup Simular AI, combines frontier models with models specialized for using computers. The agent achieves state-of-the-art performance on tasks like using apps and manipulating files--and suggests that turning to different models in different situations may help agents advance. "Computer-using agents are different from large language models and different from coding," says Ang Li, cofounder and CEO of Simular. In Simular's approach, a powerful general-purpose AI model, like OpenAI's GPT-4o or Anthropic's Claude 3.7, is used to reason about how best to complete the task at hand--while smaller open source models step in for tasks like interpreting web pages.
Attacking Vision-Language Computer Agents via Pop-ups
Zhang, Yanzhe, Yu, Tao, Yang, Diyi
Autonomous agents powered by large vision and language models (VLM) have demonstrated significant potential in completing daily computer tasks, such as browsing the web to book travel and operating desktop software, which requires agents to understand these interfaces. Despite such visual inputs becoming more integrated into agentic applications, what types of risks and attacks exist around them still remain unclear. In this work, we demonstrate that VLM agents can be easily attacked by a set of carefully designed adversarial pop-ups, which human users would typically recognize and ignore. This distraction leads agents to click these pop-ups instead of performing the tasks as usual. Integrating these pop-ups into existing agent testing environments like OSWorld and VisualWebArena leads to an attack success rate (the frequency of the agent clicking the pop-ups) of 86% on average and decreases the task success rate by 47%. Basic defense techniques such as asking the agent to ignore pop-ups or including an advertisement notice, are ineffective against the attack.
- North America > United States > California > Santa Clara County > San Jose (0.04)
- North America > United States > Ohio > Franklin County > Columbus (0.04)
- North America > United States > New York > New York County > New York City (0.04)
- (2 more...)
- Information Technology > Security & Privacy (1.00)
- Government (0.68)
Agent S: An Open Agentic Framework that Uses Computers Like a Human
Agashe, Saaket, Han, Jiuzhou, Gan, Shuyu, Yang, Jiachen, Li, Ang, Wang, Xin Eric
We present Agent S, an open agentic framework that enables autonomous interaction with computers through a Graphical User Interface (GUI), aimed at transforming human-computer interaction by automating complex, multi-step tasks. Agent S aims to address three key challenges in automating computer tasks: acquiring domain-specific knowledge, planning over long task horizons, and handling dynamic, non-uniform interfaces. To this end, Agent S introduces experience-augmented hierarchical planning, which learns from external knowledge search and internal experience retrieval at multiple levels, facilitating efficient task planning and subtask execution. In addition, it employs an Agent-Computer Interface (ACI) to better elicit the reasoning and control capabilities of GUI agents based on Multimodal Large Language Models (MLLMs). Evaluation on the OSWorld benchmark shows that Agent S outperforms the baseline by 9.37% on success rate (an 83.6% relative improvement) and achieves a new state-of-the-art. Comprehensive analysis highlights the effectiveness of individual components and provides insights for future improvements. Furthermore, Agent S demonstrates broad generalizability to different operating systems on a newly-released WindowsAgentArena benchmark. Code available at https://github.com/simular-ai/Agent-S.
- Europe > Austria > Vienna (0.14)
- North America > United States > California > San Francisco County > San Francisco (0.14)
- North America > United States > Louisiana > Orleans Parish > New Orleans (0.05)
- (10 more...)
- Workflow (0.68)
- Research Report (0.64)
- Information Technology > Human Computer Interaction > Interfaces (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)