Legg, Shane
Avoiding Tampering Incentives in Deep RL via Decoupled Approval
Uesato, Jonathan, Kumar, Ramana, Krakovna, Victoria, Everitt, Tom, Ngo, Richard, Legg, Shane
If reinforcement learning (RL) agents are to have a large influence in society, it is essential that we have reliable mechanisms to communicate our preferences to these systems. In the standard RL paradigm, the role of communicating our preferences is played by the reward function. However, it may not be possible to restrict sufficiently general RL agents from modifying physical implementations of their reward function, or more generally tampering with whatever process produces inputs to the learning algorithm, instead of pursuing the intended goal. Our central concern is the tampering problem, which can be summarized as: How can we design agents that pursue a given objective when all feedback mechanisms for describing that objective are influenceable by the agent? As a simplified example, consider designing an automated personal assistant with the objective of being useful for its user.
Algorithms for Causal Reasoning in Probability Trees
Genewein, Tim, McGrath, Tom, Dรฉletang, Grรฉgoire, Mikulik, Vladimir, Martic, Miljan, Legg, Shane, Ortega, Pedro A.
Probability trees are one of the simplest models of causal generative processes. They possess clean semantics and -- unlike causal Bayesian networks -- they can represent context-specific causal dependencies, which are necessary for e.g. causal induction. Yet, they have received little attention from the AI and ML community. Here we present concrete algorithms for causal reasoning in discrete probability trees that cover the entire causal hierarchy (association, intervention, and counterfactuals), and operate on arbitrary propositional and causal events. Our work expands the domain of causal reasoning to a very general class of discrete stochastic processes.
Meta-trained agents implement Bayes-optimal agents
Mikulik, Vladimir, Delรฉtang, Grรฉgoire, McGrath, Tom, Genewein, Tim, Martic, Miljan, Legg, Shane, Ortega, Pedro A.
Memory-based meta-learning is a powerful technique to build agents that adapt fast to any task within a target distribution. A previous theoretical study has argued that this remarkable performance is because the meta-training protocol incentivises agents to behave Bayes-optimally. We empirically investigate this claim on a number of prediction and bandit tasks. Inspired by ideas from theoretical computer science, we show that meta-learned and Bayes-optimal agents not only behave alike, but they even share a similar computational structure, in the sense that one agent system can approximately simulate the other. Furthermore, we show that Bayes-optimal agents are fixed points of the meta-learning dynamics. Our results suggest that memory-based meta-learning might serve as a general technique for numerically approximating Bayes-optimal agents - that is, even for task distributions for which we currently don't possess tractable models.
Avoiding Side Effects By Considering Future Tasks
Krakovna, Victoria, Orseau, Laurent, Ngo, Richard, Martic, Miljan, Legg, Shane
Designing reward functions is difficult: the designer has to specify what to do (what it means to complete the task) as well as what not to do (side effects that should be avoided while completing the task). To alleviate the burden on the reward designer, we propose an algorithm to automatically generate an auxiliary reward function that penalizes side effects. This auxiliary objective rewards the ability to complete possible future tasks, which decreases if the agent causes side effects during the current task. The future task reward can also give the agent an incentive to interfere with events in the environment that make future tasks less achievable, such as irreversible actions by other agents. To avoid this interference incentive, we introduce a baseline policy that represents a default course of action (such as doing nothing), and use it to filter out future tasks that are not achievable by default. We formally define interference incentives and show that the future task approach with a baseline policy avoids these incentives in the deterministic case. Using gridworld environments that test for side effects and interference, we show that our method avoids interference and is more effective for avoiding side effects than the common approach of penalizing irreversible actions.
Quantifying Differences in Reward Functions
Gleave, Adam, Dennis, Michael, Legg, Shane, Russell, Stuart, Leike, Jan
For many tasks, the reward function is too complex to be specified procedurally, and must instead be learned from user data. Prior work has evaluated learned reward functions by examining rollouts from a policy optimized for the learned reward. However, this method cannot distinguish between the learned reward function failing to reflect user preferences, and the reinforcement learning algorithm failing to optimize the learned reward. Moreover, the rollout method is highly sensitive to details of the environment the learned reward is evaluated in, which often differ in the deployment environment. To address these problems, we introduce the Equivalent-Policy Invariant Comparison (EPIC) distance to quantify the difference between two reward functions directly, without training a policy. We prove EPIC is invariant on an equivalence class of reward functions that always induce the same optimal policy. Furthermore, we find EPIC can be precisely approximated and is more robust than baselines to the choice of visitation distribution. Finally, we find that the EPIC distance of learned reward functions to the ground-truth reward is predictive of the success of training a policy, even in different transition dynamics.
Modeling AGI Safety Frameworks with Causal Influence Diagrams
Everitt, Tom, Kumar, Ramana, Krakovna, Victoria, Legg, Shane
One of the primary goals of AI research is the development of artificial agents that can exceed human performance on a wide range of cognitive tasks, in other words, artificial general intelligence (AGI). Although the development of AGI has many potential benefits, there are also many safety concerns that have been raised in the literature [Bostrom, 2014; Everitt et al., 2018; Amodei et al., 2016]. Various approaches for addressing AGI safety have been proposed [Leike et al., 2018; Christiano et al., 2018; Irving et al., 2018; Hadfield-Menell et al., 2016; Everitt, 2018], often presented as a modification of the reinforcement learning (RL) framework, or a new framework altogether. Understanding and comparing different frameworks for AGI safety can be difficult because they build on differing concepts and assumptions. For example, both reward modeling [Leike et al., 2018] and cooperative inverse RL [Hadfield-Menell et al., 2016] are frameworks for making an agent learn the preferences of a human user, but what are the key differences between them?
Meta-learning of Sequential Strategies
Ortega, Pedro A., Wang, Jane X., Rowland, Mark, Genewein, Tim, Kurth-Nelson, Zeb, Pascanu, Razvan, Heess, Nicolas, Veness, Joel, Pritzel, Alex, Sprechmann, Pablo, Jayakumar, Siddhant M., McGrath, Tom, Miller, Kevin, Azar, Mohammad, Osband, Ian, Rabinowitz, Neil, Gyรถrgy, Andrรกs, Chiappa, Silvia, Osindero, Simon, Teh, Yee Whye, van Hasselt, Hado, de Freitas, Nando, Botvinick, Matthew, Legg, Shane
In this report we review memory-based meta-learning as a tool for building sample-efficient strategies that learn from past experience to adapt to any task within a target class. Our goal is to equip the reader with the conceptual foundations of this tool for building new, scalable agents that operate on broad domains. To do so, we present basic algorithmic templates for building near-optimal predictors and reinforcement learners which behave as if they had a probabilistic model that allowed them to efficiently exploit task structure. Furthermore, we recast memory-based meta-learning within a Bayesian framework, showing that the meta-learned strategies are near-optimal because they amortize Bayes-filtered data, where the adaptation is implemented in the memory dynamics as a state-machine of sufficient statistics. Essentially, memory-based meta-learning translates the hard problem of probabilistic sequential inference into a regression problem.
Understanding Agent Incentives using Causal Influence Diagrams. Part I: Single Action Settings
Everitt, Tom, Ortega, Pedro A., Barnes, Elizabeth, Legg, Shane
Agents are systems that optimize an objective function in an environment. Together, the goal and the environment induce secondary objectives, incentives. Modeling the agent-environment interaction in graphical models called influence diagrams, we can answer two fundamental questions about an agent's incentives directly from the graph: (1) which nodes is the agent incentivized to observe, and (2) which nodes is the agent incentivized to influence? The answers tell us which information and influence points need extra protection. For example, we may want a classifier for job applications to not use the ethnicity of the candidate, and a reinforcement learning agent not to take direct control of its reward mechanism. Different algorithms and training paradigms can lead to different influence diagrams, so our method can be used to identify algorithms with problematic incentives and help in designing algorithms with better incentives.
Soft-Bayes: Prod for Mixtures of Experts with Log-Loss
Orseau, Laurent, Lattimore, Tor, Legg, Shane
We consider prediction with expert advice under the log-loss with the goal of deriving efficient and robust algorithms. We argue that existing algorithms such as exponentiated gradient, online gradient descent and online Newton step do not adequately satisfy both requirements. Our main contribution is an analysis of the Prod algorithm that is robust to any data sequence and runs in linear time relative to the number of experts in each round. Despite the unbounded nature of the log-loss, we derive a bound that is independent of the largest loss and of the largest gradient, and depends only on the number of experts and the time horizon. Furthermore we give a Bayesian interpretation of Prod and adapt the algorithm to derive a tracking regret.
Reward learning from human preferences and demonstrations in Atari
Ibarz, Borja, Leike, Jan, Pohlen, Tobias, Irving, Geoffrey, Legg, Shane, Amodei, Dario
To solve complex real-world problems with reinforcement learning, we cannot rely on manually specified reward functions. Instead, we can have humans communicate an objective to the agent directly. In this work, we combine two approaches to learning from human feedback: expert demonstrations and trajectory preferences. We train a deep neural network to model the reward function and use its predicted reward to train an DQN-based deep reinforcement learning agent on 9 Atari games. Our approach beats the imitation learning baseline in 7 games and achieves strictly superhuman performance on 2 games without using game rewards. Additionally, we investigate the goodness of fit of the reward model, present some reward hacking problems, and study the effects of noise in the human labels.