Tampering problems, where an AI agent interferes with whatever represents or communicates its intended objective and pursues the resulting corrupted objective instead, are a staple concern in the AGI safety literature [Amodei et al., 2016, Bostrom, 2014, Everitt and Hutter, 2016, Everitt et al., 2017, Armstrong and O'Rourke, 2017, Everitt and Hutter, 2019, Armstrong et al., 2020]. Variations on the idea of tampering include wireheading, where an agent learns how to stimulate its reward mechanism directly, and the off-switch or shutdown problem, where an agent interferes with its supervisor's ability to halt the agent's operation. Many real-world concerns can be formulated as tampering problems, as we will show (§2.1, §4.1). However, what constitutes tampering can be tricky to define precisely, despite clear intuitions in specific cases. We have developed a platform, REALab, to model tampering problems.
This paper presents a novel approach to the technical analysis of wireheading in intelligent agents. Inspired by the natural analogues of wireheading and their prevalent manifestations, we propose the modeling of such phenomenon in Reinforcement Learning (RL) agents as psychological disorders. In a preliminary step towards evaluating this proposal, we study the feasibility and dynamics of emergent addictive policies in Q-learning agents in the tractable environment of the game of Snake. We consider a slightly modified settings for this game, in which the environment provides a "drug" seed alongside the original "healthy" seed for the consumption of the snake. We adopt and extend an RL-based model of natural addiction to Q-learning agents in this settings, and derive sufficient parametric conditions for the emergence of addictive behaviors in such agents. Furthermore, we evaluate our theoretical analysis with three sets of simulation-based experiments. The results demonstrate the feasibility of addictive wireheading in RL agents, and provide promising venues of further research on the psychopathological modeling of complex AI safety problems.
One of the primary goals of AI research is the development of artificial agents that can exceed human performance on a wide range of cognitive tasks, in other words, artificial general intelligence (AGI). Although the development of AGI has many potential benefits, there are also many safety concerns that have been raised in the literature [Bostrom, 2014; Everitt et al., 2018; Amodei et al., 2016]. Various approaches for addressing AGI safety have been proposed [Leike et al., 2018; Christiano et al., 2018; Irving et al., 2018; Hadfield-Menell et al., 2016; Everitt, 2018], often presented as a modification of the reinforcement learning (RL) framework, or a new framework altogether. Understanding and comparing different frameworks for AGI safety can be difficult because they build on differing concepts and assumptions. For example, both reward modeling [Leike et al., 2018] and cooperative inverse RL [Hadfield-Menell et al., 2016] are frameworks for making an agent learn the preferences of a human user, but what are the key differences between them?
Can an arbitrarily intelligent reinforcement learning agent be kept under control by a human user? Or do agents with sufficient intelligence inevitably find ways to shortcut their reward signal? This question impacts how far reinforcement learning can be scaled, and whether alternative paradigms must be developed in order to build safe artificial general intelligence. In this paper, we use an intuitive yet precise graphical model called causal influence diagrams to formalize reward tampering problems. We also describe a number of modifications to the reinforcement learning objective that prevent incentives for reward tampering. We verify the solutions using recently developed graphical criteria for inferring agent incentives from causal influence diagrams. Along the way, we also compare corrigibility and self-preservation properties of the various solutions, and discuss how they can be combined into a single agent without reward tampering incentives.