Everitt, Tom
Measuring Goal-Directedness
MacDermott, Matt, Fox, James, Belardinelli, Francesco, Everitt, Tom
We define maximum entropy goal-directedness (MEG), a formal measure of goal-directedness in causal models and Markov decision processes, and give algorithms for computing it. Measuring goal-directedness is important, as it is a critical element of many concerns about harm from AI. It is also of philosophical interest, as goal-directedness is a key aspect of agency. MEG is based on an adaptation of the maximum causal entropy framework used in inverse reinforcement learning. It can measure goal-directedness with respect to a known utility function, a hypothesis class of utility functions, or a set of random variables. We prove that MEG satisfies several desiderata and demonstrate our algorithms with small-scale experiments.
A Mechanism-Based Approach to Mitigating Harms from Persuasive Generative AI
El-Sayed, Seliem, Akbulut, Canfer, McCroskery, Amanda, Keeling, Geoff, Kenton, Zachary, Jalan, Zaria, Marchal, Nahema, Manzini, Arianna, Shevlane, Toby, Vallor, Shannon, Susser, Daniel, Franklin, Matija, Bridgers, Sophie, Law, Harry, Rahtz, Matthew, Shanahan, Murray, Tessler, Michael Henry, Douillard, Arthur, Everitt, Tom, Brown, Sasha
Recent generative AI systems have demonstrated more advanced persuasive capabilities and are increasingly permeating areas of life where they can influence decision-making. Generative AI presents a new risk profile of persuasion due the opportunity for reciprocal exchange and prolonged interactions. This has led to growing concerns about harms from AI persuasion and how they can be mitigated, highlighting the need for a systematic study of AI persuasion. The current definitions of AI persuasion are unclear and related harms are insufficiently studied. Existing harm mitigation approaches prioritise harms from the outcome of persuasion over harms from the process of persuasion. In this paper, we lay the groundwork for the systematic study of AI persuasion. We first put forward definitions of persuasive generative AI. We distinguish between rationally persuasive generative AI, which relies on providing relevant facts, sound reasoning, or other forms of trustworthy evidence, and manipulative generative AI, which relies on taking advantage of cognitive biases and heuristics or misrepresenting information. We also put forward a map of harms from AI persuasion, including definitions and examples of economic, physical, environmental, psychological, sociocultural, political, privacy, and autonomy harm. We then introduce a map of mechanisms that contribute to harmful persuasion. Lastly, we provide an overview of approaches that can be used to mitigate against process harms of persuasion, including prompt engineering for manipulation classification and red teaming. Future work will operationalise these mitigations and study the interaction between different types of mechanisms of persuasion.
Robust agents learn causal world models
Richens, Jonathan, Everitt, Tom
It has long been hypothesised that causal reasoning plays a fundamental role in robust and general intelligence. However, it is not known if agents must learn causal models in order to generalise to new domains, or if other inductive biases are sufficient. We answer this question, showing that any agent capable of satisfying a regret bound under a large set of distributional shifts must have learned an approximate causal model of the data generating process, which converges to the true causal model for optimal agents. We discuss the implications of this result for several research areas including transfer learning and causal inference.
The Reasons that Agents Act: Intention and Instrumental Goals
Ward, Francis Rhys, MacDermott, Matt, Belardinelli, Francesco, Toni, Francesca, Everitt, Tom
Intention is an important and challenging concept in AI. It is important because it underlies many other concepts we care about, such as agency, manipulation, legal responsibility, and blame. However, ascribing intent to AI systems is contentious, and there is no universally accepted theory of intention applicable to AI agents. We operationalise the intention with which an agent acts, relating to the reasons it chooses its decision. We introduce a formal definition of intention in structural causal influence models, grounded in the philosophy literature on intent and applicable to real-world machine learning systems. Through a number of examples and results, we show that our definition captures the intuitive notion of intent and satisfies desiderata set-out by past work. In addition, we show how our definition relates to past concepts, including actual causality, and the notion of instrumental goals, which is a core idea in the literature on safe AI agents. Finally, we demonstrate how our definition can be used to infer the intentions of reinforcement learning agents and language models from their behaviour.
Honesty Is the Best Policy: Defining and Mitigating AI Deception
Ward, Francis Rhys, Belardinelli, Francesco, Toni, Francesca, Everitt, Tom
Deceptive agents are a challenge for the safety, trustworthiness, and cooperation of AI systems. We focus on the problem that agents might deceive in order to achieve their goals (for instance, in our experiments with language models, the goal of being evaluated as truthful). There are a number of existing definitions of deception in the literature on game theory and symbolic AI, but there is no overarching theory of deception for learning agents in games. We introduce a formal definition of deception in structural causal games, grounded in the philosophy literature, and applicable to real-world machine learning systems. Several examples and results illustrate that our formal definition aligns with the philosophical and commonsense meaning of deception. Our main technical result is to provide graphical criteria for deception. We show, experimentally, that these results can be used to mitigate deception in reinforcement learning agents and language models.
Characterising Decision Theories with Mechanised Causal Graphs
MacDermott, Matt, Everitt, Tom, Belardinelli, Francesco
How should my own decisions affect my beliefs about the outcomes I expect to achieve? If taking a certain action makes me view myself as a certain type of person, it might affect how I think others view me, and how I view others who are similar to me. This can influence my expected utility calculations and change which action I perceive to be best. Whether and how it should is subject to debate, with contenders for how to think about it including evidential decision theory, causal decision theory, and functional decision theory. In this paper, we show that mechanised causal models can be used to characterise and differentiate the most important decision theories, and generate a taxonomy of different decision theories.
Human Control: Definitions and Algorithms
Carey, Ryan, Everitt, Tom
How can humans stay in control of advanced artificial intelligence systems? One proposal is corrigibility, which requires the agent to follow the instructions of a human overseer, without inappropriately influencing them. In this paper, we formally define a variant of corrigibility called shutdown instructability, and show that it implies appropriate shutdown behavior, retention of human autonomy, and avoidance of user harm. We also analyse the related concepts of non-obstruction and shutdown alignment, three previously proposed algorithms for human control, and one new algorithm.
Reasoning about Causality in Games
Hammond, Lewis, Fox, James, Everitt, Tom, Carey, Ryan, Abate, Alessandro, Wooldridge, Michael
Causal reasoning and game-theoretic reasoning are fundamental topics in artificial intelligence, among many other disciplines: this paper is concerned with their intersection. Despite their importance, a formal framework that supports both these forms of reasoning has, until now, been lacking. We offer a solution in the form of (structural) causal games, which can be seen as extending Pearl's causal hierarchy to the game-theoretic domain, or as extending Koller and Milch's multi-agent influence diagrams to the causal domain. We then consider three key questions: i) How can the (causal) dependencies in games - either between variables, or between strategies - be modelled in a uniform, principled manner? ii) How may causal queries be computed in causal games, and what assumptions does this require? iii) How do causal games compare to existing formalisms? To address question i), we introduce mechanised games, which encode dependencies between agents' decision rules and the distributions governing the game. In response to question ii), we present definitions of predictions, interventions, and counterfactuals, and discuss the assumptions required for each. Regarding question iii), we describe correspondences between causal games and other formalisms, and explain how causal games can be used to answer queries that other causal or game-theoretic models do not support. Finally, we highlight possible applications of causal games, aided by an extensive open-source Python library.
Shaking the foundations: delusions in sequence models for interaction and control
Ortega, Pedro A., Kunesch, Markus, Delétang, Grégoire, Genewein, Tim, Grau-Moya, Jordi, Veness, Joel, Buchli, Jonas, Degrave, Jonas, Piot, Bilal, Perolat, Julien, Everitt, Tom, Tallec, Corentin, Parisotto, Emilio, Erez, Tom, Chen, Yutian, Reed, Scott, Hutter, Marcus, de Freitas, Nando, Legg, Shane
The recent phenomenal success of language models has reinvigorated machine learning research, and large sequence models such as transformers are being applied to a variety of domains. One important problem class that has remained relatively elusive however is purposeful adaptive behavior. Currently there is a common perception that sequence models "lack the understanding of the cause and effect of their actions" leading them to draw incorrect inferences due to auto-suggestive delusions. In this report we explain where this mismatch originates, and show that it can be resolved by treating actions as causal interventions. Finally, we show that in supervised learning, one can teach a system to condition or intervene on data by training with factual and counterfactual error signals respectively.
Alignment of Language Agents
Kenton, Zachary, Everitt, Tom, Weidinger, Laura, Gabriel, Iason, Mikulik, Vladimir, Irving, Geoffrey
For artificial intelligence to be beneficial to humans the behaviour of AI agents needs to be aligned with what humans want. In this paper we discuss some behavioural issues for language agents, arising from accidental misspecification by the system designer. We highlight some ways that misspecification can occur and discuss some behavioural issues that could arise from misspecification, including deceptive or manipulative language, and review some approaches for avoiding these issues.