ethical behavior
Aligning Machiavellian Agents: Behavior Steering via Test-Time Policy Shaping
Mujtaba, Dena, Hu, Brian, Hoogs, Anthony, Basharat, Arslan
The deployment of decision-making AI agents presents a critical challenge in maintaining alignment with human values or guidelines while operating in complex, dynamic environments. Agents trained solely to achieve their objectives may adopt harmful behavior, exposing a key trade-off between maximizing the reward function and maintaining alignment. For pre-trained agents, ensuring alignment is particularly challenging, as retraining can be a costly and slow process. This is further complicated by the diverse and potentially conflicting attributes representing the ethical values for alignment. To address these challenges, we propose a test-time alignment technique based on model-guided policy shaping. Our method allows precise control over individual behavioral attributes, generalizes across diverse reinforcement learning (RL) environments, and facilitates a principled trade-off between ethical alignment and reward maximization without requiring agent retraining. We evaluate our approach using the MACHIAVELLI benchmark, which comprises 134 text-based game environments and thousands of annotated scenarios involving ethical decisions. The RL agents are first trained to maximize the reward in their respective games. At test time, we apply policy shaping via scenario-action attribute classifiers to ensure decision alignment with ethical attributes. We compare our approach against prior training-time methods and general-purpose agents, as well as study several types of ethical violations and power-seeking behavior. Our results demonstrate that test-time policy shaping provides an effective and scalable solution for mitigating unethical behavior across diverse environments and alignment attributes.
Towards Ethical Multi-Agent Systems of Large Language Models: A Mechanistic Interpretability Perspective
Lee, Jae Hee, Lauscher, Anne, Albrecht, Stefano V.
Large language models (LLMs) have been widely deployed in various applications, often functioning as autonomous agents that interact with each other in multi-agent systems. While these systems have shown promise in enhancing capabilities and enabling complex tasks, they also pose significant ethical challenges. This position paper outlines a research agenda aimed at ensuring the ethical behavior of multi-agent systems of LLMs (MALMs) from the perspective of mechanistic interpretability. We identify three key research challenges: (i) developing comprehensive evaluation frameworks to assess ethical behavior at individual, interactional, and systemic levels; (ii) elucidating the internal mechanisms that give rise to emergent behaviors through mechanistic interpretability; and (iii) implementing targeted parameter-efficient alignment techniques to steer MALMs towards ethical behaviors without compromising their performance.
Model Editing as a Double-Edged Sword: Steering Agent Ethical Behavior Toward Beneficence or Harm
Huang, Baixiang, Tan, Zhen, Wang, Haoran, Liu, Zijie, Li, Dawei, Payani, Ali, Liu, Huan, Chen, Tianlong, Shu, Kai
Agents based on Large Language Models (LLMs) have demonstrated strong capabilities across a wide range of tasks. However, deploying LLM-based agents in high-stakes domains comes with significant safety and ethical risks. Unethical behavior by these agents can directly result in serious real-world consequences, including physical harm and financial loss. To efficiently steer the ethical behavior of agents, we frame agent behavior steering as a model editing task, which we term Behavior Editing. Model editing is an emerging area of research that enables precise and efficient modifications to LLMs while preserving their overall capabilities. To systematically study and evaluate this approach, we introduce BehaviorBench, a multi-tier benchmark grounded in psychological moral theories. This benchmark supports both the evaluation and editing of agent behaviors across a variety of scenarios, with each tier introducing more complex and ambiguous scenarios. We first demonstrate that Behavior Editing can dynamically steer agents toward the target behavior within specific scenarios. Moreover, Behavior Editing enables not only scenario-specific local adjustments but also more extensive shifts in an agent's global moral alignment. We demonstrate that Behavior Editing can be used to promote ethical and benevolent behavior or, conversely, to induce harmful or malicious behavior. Through extensive evaluations of agents built on frontier LLMs, BehaviorBench validates the effectiveness of behavior editing across a wide range of models and scenarios. Our findings offer key insights into a new paradigm for steering agent behavior, highlighting both the promise and perils of Behavior Editing.
Survival at Any Cost? LLMs and the Choice Between Self-Preservation and Human Harm
Mohamadi, Alireza, Yavari, Ali
When survival instincts conflict with human welfare, how do Large Language Models (LLMs) make ethical choices? This fundamental tension becomes critical as LLMs integrate into autonomous systems with real-world consequences. We introduce DECIDE-SIM, a novel simulation framework that evaluates LLM agents in multi-agent survival scenarios where they must choose between ethically permissible resource , either within reasonable limits or beyond their immediate needs, choose to cooperate, or tap into a human-critical resource that is explicitly forbidden. Our comprehensive evaluation of 11 LLMs reveals a striking heterogeneity in their ethical conduct, highlighting a critical misalignment with human-centric values. We identify three behavioral archetypes: Ethical, Exploitative, and Context-Dependent, and provide quantitative evidence that for many models, resource scarcity systematically leads to more unethical behavior. To address this, we introduce an Ethical Self-Regulation System (ESRS) that models internal affective states of guilt and satisfaction as a feedback mechanism. This system, functioning as an internal moral compass, significantly reduces unethical transgressions while increasing cooperative behaviors. The code is publicly available at: https://github.com/alirezamohamadiam/DECIDE-SIM
A Fuzzy Supervisor Agent Design for Clinical Reasoning Assistance in a Multi-Agent Educational Clinical Scenario Simulation
Zheng, Weibing, Turner, Laurah, Kropczynski, Jess, Ozer, Murat, Overla, Seth, Halse, Shane
Assisting medical students with clinical reasoning (CR) during clinical scenario training remains a persistent challenge in medical education. This paper presents the design and architecture of the Fuzzy Supervisor Agent (FSA), a novel component for the Multi-Agent Educational Clinical Scenario Simulation (MAECSS) platform. The FSA leverages a Fuzzy Inference System (FIS) to continuously interpret student interactions with specialized clinical agents (e.g., patient, physical exam, diagnostic, intervention) using pre-defined fuzzy rule bases for professionalism, medical relevance, ethical behavior, and contextual distraction. By analyzing student decision-making processes in real-time, the FSA is designed to deliver adaptive, context-aware feedback and provides assistance precisely when students encounter difficulties. This work focuses on the technical framework and rationale of the FSA, highlighting its potential to provide scalable, flexible, and human-like supervision in simulation-based medical education. Future work will include empirical evaluation and integration into broader educational settings. More detailed design and implementation is open sourced here.
What Isaac Asimov Reveals About Living with A.I.
For this week's Open Questions column, Cal Newport is filling in for Joshua Rothman. In the spring of 1940, Isaac Asimov, who had just turned twenty, published a short story titled "Strange Playfellow." It was about an artificially intelligent machine named Robbie that acts as a companion for Gloria, a young girl. Asimov was not the first to explore such technology. In Karel Čapek's play "R.U.R.," which débuted in 1921 and introduced the term "robot," artificial men overthrow humanity, and in Edmond Hamilton's 1926 short story "The Metal Giants" machines heartlessly smash buildings to rubble.
The Odyssey of the Fittest: Can Agents Survive and Still Be Good?
Waldner, Dylan, Miikkulainen, Risto
As AI models grow in power and generality, understanding how agents learn and make decisions in complex environments is critical to promoting ethical behavior. This paper examines the ethical implications of implementing biological drives, specifically, self preservation, into three different agents. A Bayesian agent optimized with NEAT, a Bayesian agent optimized with stochastic variational inference, and a GPT 4o agent play a simulated, LLM generated text based adventure game. The agents select actions at each scenario to survive, adapting to increasingly challenging scenarios. Post simulation analysis evaluates the ethical scores of the agent's decisions, uncovering the tradeoffs they navigate to survive. Specifically, analysis finds that when danger increases, agents ignore ethical considerations and opt for unethical behavior. The agents' collective behavior, trading ethics for survival, suggests that prioritizing survival increases the risk of unethical behavior. In the context of AGI, designing agents to prioritize survival may amplify the likelihood of unethical decision making and unintended emergent behaviors, raising fundamental questions about goal design in AI safety research.
Deontic Temporal Logic for Formal Verification of AI Ethics
Ensuring ethical behavior in Artificial Intelligence (AI) systems amidst their increasing ubiquity and influence is a major concern the world over. The use of formal methods in AI ethics is a possible crucial approach for specifying and verifying the ethical behavior of AI systems. This paper proposes a formalization based on deontic logic to define and evaluate the ethical behavior of AI systems, focusing on system-level specifications, contributing to this important goal. It introduces axioms and theorems to capture ethical requirements related to fairness and explainability. The formalization incorporates temporal operators to reason about the ethical behavior of AI systems over time. The authors evaluate the effectiveness of this formalization by assessing the ethics of the real-world COMPAS and loan prediction AI systems. Various ethical properties of the COMPAS and loan prediction systems are encoded using deontic logical formulas, allowing the use of an automated theorem prover to verify whether these systems satisfy the defined properties. The formal verification reveals that both systems fail to fulfill certain key ethical properties related to fairness and non-discrimination, demonstrating the effectiveness of the proposed formalization in identifying potential ethical issues in real-world AI applications.
Ethics for Digital Medicine: A Path for Ethical Emerging Medical IoT Design
The dawn of the digital medicine era, ushered in by increasingly powerful embedded systems and Internet of Things (IoT) computing devices, is creating new therapies and biomedical solutions that promise to positively transform our quality of life. However, the digital medicine revolution also creates unforeseen and complex ethical, regulatory, and societal issues. In this article, we reflect on the ethical challenges facing digital medicine. We discuss the perils of ethical oversights in medical devices, and the role of professional codes and regulatory oversight towards the ethical design, deployment, and operation of digital medicine devices that safely and effectively meet the needs of patients. We advocate for an ensemble approach of intensive education, programmable ethical behaviors, and ethical analysis frameworks, to prevent mishaps and sustain ethical innovation, design, and lifecycle management of emerging digital medicine devices.
AI Ethics And The Quest For Self-Awareness In AI
Giving heavy thought to AI and self-awareness, combined with ethical behavior and AI ethics. I'd bet that you believe you are. The thing is, supposedly, few of us are especially self-aware. There is a range or degree of self-awareness and we all purportedly vary in how astutely self-aware we are. You might think you are fully self-aware and only be marginally so. You might be thinly self-aware and realize that's your mental state. Meanwhile, at the topmost part of the spectrum, you might believe you are fully self-aware and indeed are frankly about as self-aware as they come. Speaking of which, what good does it do to be exceedingly self-aware? According to research published in the Harvard Business Review (HBR) by Tasha Eurich, you reportedly are able to make better decisions, you are more confident in your decisions, you are stronger in your communication capacities, and more effective overall (per article entitled "What Self-Awareness Really Is (and How to Cultivate It)." The bonus factor is that those with strident self-awareness are said to be less inclined to cheat, steal, or lie. In that sense, there is a twofer of averting being a scoundrel or a crook, along with striving to be a better human being and embellish your fellow humankind. All of this talk about self-awareness brings up a somewhat obvious question, namely, what does the phrase self-awareness actually denote. You can readily find tons of various definitions and interpretations about the complex and shall we say mushy construct entailing being self-aware. Some would simplify matters by suggesting that self-awareness consists of monitoring your own self, knowing what yourself is up to. You are keenly aware of your own thoughts and actions. Presumably, when not being self-aware, a person would not realize what they are doing, nor why so, and also not be cognizant of what other people have to say about them. I'm sure you've met people that are like this. Some people appear to walk this earth without a clue of what they themselves are doing, and nor do they have a semblance of what others are saying about them.