Goto

Collaborating Authors

 autogpt


Gradientsys: A Multi-Agent LLM Scheduler with ReAct Orchestration

Song, Xinyuan, Wang, Zeyu, Wu, Siyi, Shi, Tianyu, Ai, Lynn

arXiv.org Artificial Intelligence

We present Gradientsys, a next-generation multi-agent scheduling framework that coordinates diverse specialized AI agents using a typed Model-Context Protocol (MCP) and a ReAct-based dynamic planning loop. At its core, Gradientsys employs an LLM-powered scheduler for intelligent one-to-many task dispatch, enabling parallel execution of heterogeneous agents such as PDF parsers, web search modules, GUI controllers, and web builders. The framework supports hybrid synchronous/asynchronous execution, respects agent capacity constraints, and incorporates a robust retry-and-replan mechanism to handle failures gracefully. To promote transparency and trust, Gradientsys includes an observability layer streaming real-time agent activity and intermediate reasoning via Server-Sent Events (SSE). We offer an architectural overview and evaluate Gradientsys against existing frameworks in terms of extensibility, scheduling topology, tool reusability, parallelism, and observability. Experiments on the GAIA general-assistant benchmark show that Gradientsys achieves higher task success rates with reduced latency and lower API costs compared to a MinionS-style baseline, demonstrating the strength of its LLM-driven multi-agent orchestration.


CVE-Bench: A Benchmark for AI Agents' Ability to Exploit Real-World Web Application Vulnerabilities

Zhu, Yuxuan, Kellermann, Antony, Bowman, Dylan, Li, Philip, Gupta, Akul, Danda, Adarsh, Fang, Richard, Jensen, Conner, Ihli, Eric, Benn, Jason, Geronimo, Jet, Dhir, Avi, Rao, Sudhit, Yu, Kaicheng, Stone, Twm, Kang, Daniel

arXiv.org Artificial Intelligence

Large language model (LLM) agents are increasingly capable of autonomously conducting cyberattacks, posing significant threats to existing applications. This growing risk highlights the urgent need for a real-world benchmark to evaluate the ability of LLM agents to exploit web application vulnerabilities. However, existing benchmarks fall short as they are limited to abstracted Capture the Flag competitions or lack comprehensive coverage. Building a benchmark for real-world vulnerabilities involves both specialized expertise to reproduce exploits and a systematic approach to evaluating unpredictable threats. To address this challenge, we introduce CVE-Bench, a real-world cybersecurity benchmark based on critical-severity Common Vulnerabilities and Exposures. In CVE-Bench, we design a sandbox framework that enables LLM agents to exploit vulnerable web applications in scenarios that mimic real-world conditions, while also providing effective evaluation of their exploits. Our evaluation shows that the state-of-the-art agent framework can resolve up to 13% of vulnerabilities.


Alignment, Agency and Autonomy in Frontier AI: A Systems Engineering Perspective

Tallam, Krti

arXiv.org Artificial Intelligence

As artificial intelligence scales, the concepts of alignment, agency, and autonomy have become central to AI safety, governance, and control. However, even in human contexts, these terms lack universal definitions, varying across disciplines such as philosophy, psychology, law, computer science, mathematics, and political science. This inconsistency complicates their application to AI, where differing interpretations lead to conflicting approaches in system design and regulation. This paper traces the historical, philosophical, and technical evolution of these concepts, emphasizing how their definitions influence AI development, deployment, and oversight. We argue that the urgency surrounding AI alignment and autonomy stems not only from technical advancements but also from the increasing deployment of AI in high-stakes decision making. Using Agentic AI as a case study, we examine the emergent properties of machine agency and autonomy, highlighting the risks of misalignment in real-world systems. Through an analysis of automation failures (Tesla Autopilot, Boeing 737 MAX), multi-agent coordination (Metas CICERO), and evolving AI architectures (DeepMinds AlphaZero, OpenAIs AutoGPT), we assess the governance and safety challenges posed by frontier AI.


CORE-Bench: Fostering the Credibility of Published Research Through a Computational Reproducibility Agent Benchmark

Siegel, Zachary S., Kapoor, Sayash, Nagdir, Nitya, Stroebl, Benedikt, Narayanan, Arvind

arXiv.org Artificial Intelligence

AI agents have the potential to aid users on a variety of consequential tasks, including conducting scientific research. To spur the development of useful agents, we need benchmarks that are challenging, but more crucially, directly correspond to real-world tasks of interest. This paper introduces such a benchmark, designed to measure the accuracy of AI agents in tackling a crucial yet surprisingly challenging aspect of scientific research: computational reproducibility. This task, fundamental to the scientific process, involves reproducing the results of a study using the provided code and data. We introduce CORE-Bench (Computational Reproducibility Agent Benchmark), a benchmark consisting of 270 tasks based on 90 scientific papers across three disciplines (computer science, social science, and medicine). Tasks in CORE-Bench consist of three difficulty levels and include both language-only and vision-language tasks. We provide an evaluation system to measure the accuracy of agents in a fast and parallelizable way, saving days of evaluation time for each run compared to a sequential implementation. We evaluated two baseline agents: the general-purpose AutoGPT and a task-specific agent called CORE-Agent. We tested both variants using two underlying language models: GPT-4o and GPT-4o-mini. The best agent achieved an accuracy of 21% on the hardest task, showing the vast scope for improvement in automating routine scientific tasks. Having agents that can reproduce existing work is a necessary step towards building agents that can conduct novel research and could verify and improve the performance of other research agents. We hope that CORE-Bench can improve the state of reproducibility and spur the development of future research agents.


Testing Language Model Agents Safely in the Wild

Naihin, Silen, Atkinson, David, Green, Marc, Hamadi, Merwane, Swift, Craig, Schonholtz, Douglas, Kalai, Adam Tauman, Bau, David

arXiv.org Artificial Intelligence

A prerequisite for safe autonomy-in-the-wild is safe testing-in-the-wild. Yet real-world autonomous tests face several unique safety challenges, both due to the possibility of causing harm during a test, as well as the risk of encountering new unsafe agent behavior through interactions with real-world and potentially malicious actors. We propose a framework for conducting safe autonomous agent tests on the open internet: agent actions are audited by a context-sensitive monitor that enforces a stringent safety boundary to stop an unsafe test, with suspect behavior ranked and logged to be examined by humans. We design a basic safety monitor (AgentMonitor) that is flexible enough to monitor existing LLM agents, and, using an adversarial simulated agent, we measure its ability to identify and stop unsafe situations. Then we apply the AgentMonitor on a battery of real-world tests of AutoGPT, and we identify several limitations and challenges that will face the creation of safe in-the-wild tests as autonomous agents grow more capable.


Breaking Down AutoGPT: What It Is, Its Features, Limitations, Artificial General Intelligence (AGI) And Impact of Autonomous Agents on Generative AI - MarkTechPost

#artificialintelligence

Introduction  Generative AI is evolving and getting popular. Since its introduction, new models and research papers are getting released almost every other day. The major reason for the exponentially increasing popularity is the development of Large Language Models. LLMs, the Artificial Intelligence models that are designed to process natural language and generate human-like responses, are trending. The best example is OpenAI's ChatGPT, the well-known chatbot that does everything from content generation and code completion to question answering, just like a human. Even OpenAI's DALL-E and Google's BERT have contributed to making significant advances in recent times. What is AutoGPT? Recently,



What Is ChaosGPT: Can The AI Bot Destroy Humanity? - Dataconomy

#artificialintelligence

If you're familiar with the helpful ChatGPT chatbot, which is based on the powerful natural language processing system GPT LLM developed by OpenAI, you might be surprised to hear that there's another chatbot with opposite intentions. ChaosGPT is an AI chatbot that's malicious, hostile, and wants to conquer the world. In this blog post, we'll explore what sets ChaosGPT apart from other chatbots and why it's considered a threat to humanity and the world. Let's dive in and see whether this AI chatbot has what it takes to cause real trouble in any capacity. Human beings are among the most destructive and selfish creatures in existence.


Meet AutoGPT, the autonomous GPT-4 tool revolutionizing AI

#artificialintelligence

Understanding AGI is crucial to comprehending AutoGPT, which is an autonomous GPT-4 experiment aimed at achieving a future where AI models such as GPT can independently define and perform tasks to achieve objectives without any human intervention. AutoGPT is an open-source endeavor that seeks to make GPT-4 entirely self-governing, and it has gained worldwide popularity in recent days. Several programmers have demonstrated the potential of AutoGPT through YouTube videos. This innovative technology has multiple uses, including serving as an agent for internet search and planning, autonomous coding and debugging, and functioning as an independent Twitter bot. "Auto-GPT is an experimental open-source application showcasing the capabilities of the GPT-4 language model. This program, driven by GPT-4, chains together LLM'thoughts', to autonomously achieve whatever goal you set. As one of the first examples of GPT-4 running fully autonomously, Auto-GPT pushes the boundaries of what is possible with AI," reads the GitHub page of the tool.