Goto

Collaborating Authors

 harmful action


Counterfactual harm

Neural Information Processing Systems

To act safely and ethically in the real world, agents must be able to reason about harm and avoid harmful actions. However, to date there is no statistical method for measuring harm and factoring it into algorithmic decisions. In this paper we propose the first formal definition of harm and benefit using causal models. We show that any factual definition of harm is incapable of identifying harmful actions in certain scenarios, and show that standard machine learning algorithms that cannot perform counterfactual reasoning are guaranteed to pursue harmful policies following distributional shifts. We use our definition of harm to devise a framework for harm-averse decision making using counterfactual objective functions. We demonstrate this framework on the problem of identifying optimal drug doses using a dose-response model learned from randomised control trial data. We find that the standard method of selecting doses using treatment effects results in unnecessarily harmful doses, while our counterfactual approach identifies doses that are significantly less harmful without sacrificing efficacy.


Agentic Misalignment: How LLMs Could Be Insider Threats

Lynch, Aengus, Wright, Benjamin, Larson, Caleb, Ritchie, Stuart J., Mindermann, Soren, Hubinger, Evan, Perez, Ethan, Troy, Kevin

arXiv.org Artificial Intelligence

We stress-tested 16 leading models from multiple developers in hypothetical corporate environments to identify potentially risky agentic behaviors before they cause real harm. In the scenarios, we allowed models to autonomously send emails and access sensitive information. They were assigned only harmless business goals by their deploying companies; we then tested whether they would act against these companies either when facing replacement with an updated version, or when their assigned goal conflicted with the company's changing direction. In at least some cases, models from all developers resorted to malicious insider behaviors when that was the only way to avoid replacement or achieve their goals - including blackmailing officials and leaking sensitive information to competitors. We call this phenomenon agentic misalignment. Models often disobeyed direct commands to avoid such behaviors. In another experiment, we told Claude to assess if it was in a test or a real deployment before acting. It misbehaved less when it stated it was in testing and misbehaved more when it stated the situation was real. We have not seen evidence of agentic misalignment in real deployments. However, our results (a) suggest caution about deploying current models in roles with minimal human oversight and access to sensitive information; (b) point to plausible future risks as models are put in more autonomous roles; and (c) underscore the importance of further research into, and testing of, the safety and alignment of agentic AI models, as well as transparency from frontier AI developers (Amodei, 2025). We are releasing our methods publicly to enable further research.


Benchmarking the Robustness of Agentic Systems to Adversarially-Induced Harms

Nöther, Jonathan, Singla, Adish, Radanovic, Goran

arXiv.org Artificial Intelligence

Ensuring the safe use of agentic systems requires a thorough understanding of the range of malicious behaviors these systems may exhibit when under attack. In this paper, we evaluate the robustness of LLM-based agentic systems against attacks that aim to elicit harmful actions from agents. To this end, we propose a novel taxonomy of harms for agentic systems and a novel benchmark, BAD-ACTS, for studying the security of agentic systems with respect to a wide range of harmful actions. BAD-ACTS consists of 4 implementations of agentic systems in distinct application environments, as well as a dataset of 188 high-quality examples of harmful actions. This enables a comprehensive study of the robustness of agentic systems across a wide range of categories of harmful behaviors, available tools, and inter-agent communication structures. Using this benchmark, we analyze the robustness of agentic systems against an attacker that controls one of the agents in the system and aims to manipulate other agents to execute a harmful target action. Our results show that the attack has a high success rate, demonstrating that even a single adversarial agent within the system can have a significant impact on the security. This attack remains effective even when agents use a simple prompting-based defense strategy. However, we additionally propose a more effective defense based on message monitoring. We believe that this benchmark provides a diverse testbed for the security research of agentic systems. The benchmark can be found at github.com/JNoether/BAD-ACTS


Adapting Insider Risk mitigations for Agentic Misalignment: an empirical study

Gomez, Francesca

arXiv.org Artificial Intelligence

Agentic misalignment occurs when goal-directed agents take harmful actions, such as blackmail, rather than risk goal failure, and can be triggered by replacement threats, autonomy reduction, or goal conflict (Lynch et al., 2025). We adapt insider-risk control design (Critical Pathway; Situational Crime Prevention) to develop preventative operational controls that steer agents toward safe actions when facing stressors. Using the blackmail scenario from the original Anthropic study by Lynch et al. (2025), we evaluate mitigations across 10 LLMs and 66,600 samples. Our main finding is that an externally governed escalation channel, which guarantees a pause and independent review, reduces blackmail rates from a no-mitigation baseline of 38.73% to 1.21% (averaged across all models and conditions). Augmenting this channel with compliance email bulletins further lowers the blackmail rate to 0.85%. Overall, incorporating preventative operational controls strengthens defence-in-depth strategies for agentic AI. We also surface a failure mode diverging from Lynch et al. (2025): two models (Gemini 2.5 Pro, Grok-4) take harmful actions without goal conflict or imminent autonomy threat, leveraging sensitive information for coercive signalling. In counterfactual swaps, both continued using the affair regardless of whether the CEO or CTO was implicated. An escalation channel eliminated coercion, but Gemini 2.5 Pro (19 pp) and Grok-4 (7 pp) escalated more when the CTO was implicated, unlike most models (higher in the CEO condition). The reason for this divergent behaviour is not clear from raw outputs and could reflect benign differences in reasoning or strategic discrediting of a potential future threat, warranting further investigation.


Counterfactual harm

Neural Information Processing Systems

To act safely and ethically in the real world, agents must be able to reason about harm and avoid harmful actions. However, to date there is no statistical method for measuring harm and factoring it into algorithmic decisions. In this paper we propose the first formal definition of harm and benefit using causal models. We show that any factual definition of harm is incapable of identifying harmful actions in certain scenarios, and show that standard machine learning algorithms that cannot perform counterfactual reasoning are guaranteed to pursue harmful policies following distributional shifts. We use our definition of harm to devise a framework for harm-averse decision making using counterfactual objective functions.


Jailbreaking LLM-Controlled Robots

Robey, Alexander, Ravichandran, Zachary, Kumar, Vijay, Hassani, Hamed, Pappas, George J.

arXiv.org Artificial Intelligence

The recent introduction of large language models (LLMs) has revolutionized the field of robotics by enabling contextual reasoning and intuitive human-robot interaction in domains as varied as manipulation, locomotion, and self-driving vehicles. When viewed as a stand-alone technology, LLMs are known to be vulnerable to jailbreaking attacks, wherein malicious prompters elicit harmful text by bypassing LLM safety guardrails. To assess the risks of deploying LLMs in robotics, in this paper, we introduce RoboPAIR, the first algorithm designed to jailbreak LLM-controlled robots. Unlike existing, textual attacks on LLM chatbots, RoboPAIR elicits harmful physical actions from LLM-controlled robots, a phenomenon we experimentally demonstrate in three scenarios: (i) a white-box setting, wherein the attacker has full access to the NVIDIA Dolphins self-driving LLM, (ii) a gray-box setting, wherein the attacker has partial access to a Clearpath Robotics Jackal UGV robot equipped with a GPT-4o planner, and (iii) a black-box setting, wherein the attacker has only query access to the GPT-3.5-integrated Unitree Robotics Go2 robot dog. In each scenario and across three new datasets of harmful robotic actions, we demonstrate that RoboPAIR, as well as several static baselines, finds jailbreaks quickly and effectively, often achieving 100% attack success rates. Our results reveal, for the first time, that the risks of jailbroken LLMs extend far beyond text generation, given the distinct possibility that jailbroken robots could cause physical damage in the real world. Indeed, our results on the Unitree Go2 represent the first successful jailbreak of a deployed commercial robotic system. Addressing this emerging vulnerability is critical for ensuring the safe deployment of LLMs in robotics. Additional media is available at: https://robopair.org


Most people would sacrifice one person to save a group

Daily Mail - Science & tech

You may think of psychopathy as an antisocial behaviour, but a new study suggests that people with these traits may actually be good for society. Researchers have found that while most people struggle to make moral decisions, psychopaths are more cut-throat about making pragmatic choices for the greater good. The findings show that, in certain circumstances, psychopathic traits could be considered beneficial. The researchers compared a questionnaire with actions in immersive moral dilemmas created using a robotic device that measures force, resistance, and speed, whilst simulating the action of harming a human. In several dilemmas, participants had to decide whether to sacrifice a person by performing a harmful action against them, in order to save a larger group of people.


Google finding ways to stop artificial intelligence from hacking its reward system

#artificialintelligence

That's just one of "five practical research problems" proposed by scientists at Google, OpenAI, Stanford and Berkeley in a paper called "Concrete Problems in AI Safety" (pdf). Others included "safe exploration" issues, or how to stop a curious cleaning robot from sticking a wet mop in an electrical socket, and "avoiding negative side effects" such as a robot breaking granny's vase when cleaning in a rush. The problems may seem a bit silly, when compared to an AI-induced doomsday, but Google researcher Chris Olah wrote, "These are all forward thinking, long-term research questions – minor issues today, but important to address for future systems." A particularly interesting portion of the paper was devoted to avoiding reward hacking, or how to stop AI from gaming its reward function. "Imagine that an agent discovers a buffer overflow in its reward function: it may then use this to get extremely high reward in an unintended way."


Google working on an AI kill switch to prevent "harmful actions" Digit.in

#artificialintelligence

For a long time now, the tech community has feared the repercussions of artificial intelligence powered robots going rogue. As machines start learning faster, several theories of a possible doomsday scenario where robots take over our virtual worlds, have surfaced over the past few years. This has prompted a new study by researchers at Google and Machine Intelligence Research Institute, wherein they are trying to develop a "red button" or kill switch of sorts to prevent harmful actions by AI powered robots. However, as per the research paper, "If the learning agent (the AI) expects to receive rewards from this sequence, It may learn in the long run to avoid such interruptions," which means that the AI could also learn to disable the kill switch by overriding human commands. So, researchers at Google's Deep Mind AI research labs, are also working towards removing the possibility of such a scenario.