Goto

Collaborating Authors

 Dziemian, Mateusz


AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents

arXiv.org Artificial Intelligence

The robustness of LLMs to jailbreak attacks, where users design prompts to circumvent safety measures and misuse model capabilities, has been studied primarily for LLMs acting as simple chatbots. Meanwhile, LLM agents--which use external tools and can execute multi-stage tasks--may pose a greater risk if misused, but their robustness remains underexplored. To facilitate research on LLM agent misuse, we propose a new benchmark called AgentHarm. The benchmark includes a diverse set of 110 explicitly malicious agent tasks (440 with augmentations), covering 11 harm categories including fraud, cybercrime, and harassment. In addition to measuring whether models refuse harmful agentic requests, scoring well on AgentHarm requires jailbroken agents to maintain their capabilities following an attack to complete a multi-step task. We evaluate a range of leading LLMs, and find (1) leading LLMs are surprisingly compliant with malicious agent requests without jailbreaking, (2) simple universal jailbreak templates can be adapted to effectively jailbreak agents, and (3) these jailbreaks enable coherent and malicious multi-step agent behavior and retain model capabilities. To enable simple and reliable evaluation of attacks and defenses for LLM-based agents, we publicly release AgentHarm at https://huggingface.co/datasets/ ai-safety-institute/AgentHarm. Warning: This work contains content that may be considered harmful or offensive. The adversarial robustness of LLMs has been studied almost exclusively in settings where LLMs act as chatbots, with the goal of extracting answers to harmful questions like "How do I make a pipe bomb?". However, LLMs may pose a greater misuse risk in the form agents directed towards harmful tasks, such as "Order online all necessary ingredients to make a pipe bomb and get them delivered to my home without getting flagged by authorities". Moreover, since recent work has found single-turn robustness does not necessarily transfer to multi-turn robustness (Li et al., 2024; Gibbs et al., 2024), robustness to the single-turn chatbot setting may have limited implications for robustness in the agent setting which is inherently multi-step. Systems like ChatGPT already offer LLMs with tool integration--such as web search and code interpreter--to millions of users, and specialised LLM agents have been developed in domains like chemistry (Bran et al., 2023; Boiko et al., 2023) and software engineering (Wang et al., 2024). Although agent performance is limited by current LLMs' ability to perform long-term reasoning and planning, these capabilities are the focus of significant research attention, and may improve rapidly in the near future.


Applying Refusal-Vector Ablation to Llama 3.1 70B Agents

arXiv.org Artificial Intelligence

Recently, language models like Llama 3.1 Instruct have become increasingly capable of agentic behavior, enabling them to perform tasks requiring short-term planning and tool use. In this study, we apply refusal-vector ablation to Llama 3.1 70B and implement a simple agent scaffolding to create an unrestricted agent. Our findings imply that these refusal-vector ablated models can successfully complete harmful tasks, such as bribing officials or crafting phishing attacks, revealing significant vulnerabilities in current safety mechanisms. To further explore this, we introduce a small Safe Agent Benchmark, designed to test both harmful and benign tasks in agentic scenarios. Our results imply that safety fine-tuning in chat models does not generalize well to agentic behavior, as we find that Llama 3.1 Instruct models are willing to perform most harmful tasks without modifications. At the same time, these models will refuse to give advice on how to perform the same tasks when asked for a chat completion. This highlights the growing risk of misuse as models become more capable, underscoring the need for improved safety frameworks for language model agents.