noop
AI Alignment with Changing and Influenceable Reward Functions
Carroll, Micah, Foote, Davis, Siththaranjan, Anand, Russell, Stuart, Dragan, Anca
Existing AI alignment approaches assume that preferences are static, which is unrealistic: our preferences change, and may even be influenced by our interactions with AI systems themselves. To clarify the consequences of incorrectly assuming static preferences, we introduce Dynamic Reward Markov Decision Processes (DR-MDPs), which explicitly model preference changes and the AI's influence on them. We show that despite its convenience, the static-preference assumption may undermine the soundness of existing alignment techniques, leading them to implicitly reward AI systems for influencing user preferences in ways users may not truly want. We then explore potential solutions. First, we offer a unifying perspective on how an agent's optimization horizon may partially help reduce undesirable AI influence. Then, we formalize different notions of AI alignment that account for preference change from the outset. Comparing the strengths and limitations of 8 such notions of alignment, we find that they all either err towards causing undesirable AI influence, or are overly risk-averse, suggesting that a straightforward solution to the problems of changing preferences may not exist. As there is no avoiding grappling with changing preferences in real-world settings, this makes it all the more important to handle these issues with care, balancing risks and capabilities. We hope our work can provide conceptual clarity and constitute a first step towards AI alignment practices which explicitly account for (and contend with) the changing and influenceable nature of human preferences.
- North America > United States > New York > New York County > New York City (0.27)
- North America > United States > California > San Francisco County > San Francisco (0.14)
- Europe > Austria > Vienna (0.14)
- (19 more...)
- Health & Medicine > Consumer Health (1.00)
- Government (1.00)
- Leisure & Entertainment > Games (0.67)
- Education (0.67)
Optimizing Mario Adventures in a Constrained Environment
This project proposes and compares a new way to optimise Super Mario Bros. (SMB) environment where the control is in hand of two approaches, namely, Genetic Algorithm (MarioGA) and NeuroEvolution (MarioNE). Not only we learn playing SMB using these techniques, but also optimise it with constrains of collection of coins and finishing levels. Firstly, we formalise the SMB agent to maximize the total value of collected coins (reward) and maximising the total distance traveled (reward) in order to finish the level faster (time penalty) for both the algorithms. Secondly, we study MarioGA and its evaluation function (fitness criteria) including its representation methods, crossover used, mutation operator formalism, selection method used, MarioGA loop, and few other parameters. Thirdly, MarioNE is applied on SMB where a population of ANNs with random weights is generated, and these networks control Marios actions in the game. Fourth, SMB is further constrained to complete the task within the specified time, rebirths (deaths) within the limit, and performs actions or moves within the maximum allowed moves, while seeking to maximize the total coin value collected. This ensures an efficient way of finishing SMB levels. Finally, we provide a fivefold comparative analysis by plotting fitness plots, ability to finish different levels of world 1, and domain adaptation (transfer learning) of the trained models.
Measuring and avoiding side effects using relative reachability
Krakovna, Victoria, Orseau, Laurent, Martic, Miljan, Legg, Shane
How can we design reinforcement learning agents that avoid causing unnecessary disruptions to their environment? We argue that current approaches to penalizing side effects can introduce bad incentives in tasks that require irreversible actions, and in environments that contain sources of change other than the agent. For example, some approaches give the agent an incentive to prevent any irreversible changes in the environment, including the actions of other agents. We introduce a general definition of side effects, based on relative reachability of states compared to a default state, that avoids these undesirable incentives. Using a set of gridworld experiments illustrating relevant scenarios, we empirically compare relative reachability to penalties based on existing definitions and show that it is the only penalty among those tested that produces the desired behavior in all the scenarios.
Execution Monitoring as Meta-Games for General Game-Playing Robots
Rajaratnam, David (The University of New South Wales) | Thielscher, Michael (The University of New South Wales)
General Game Playing aims to create AI systems that can understand the rules of new games and learn to play them effectively without human intervention. The recent proposal for general game-playing robots extends this to AI systems that play games in the real world. Execution monitoring becomes a necessity when moving from a virtual to a physical environment, because in reality actions may not be executed properly and (human) opponents may make illegal game moves. We develop a formal framework for execution monitoring by which an action theory that provides an axiomatic description of a game is automatically embedded in a meta-game for a robotic player — called the arbiter — whose role is to monitor and correct failed actions. This allows for the seamless encoding of recovery behaviours within a meta-game, enabling a robot to recover from these unexpected events.
Solving the Inferential Frame Problem in the General Game Description Language
Davila, Javier Romero (University of Potsdam) | Saffidine, Abdallah (University of New South Wales) | Thielscher, Michael (University of New South Wales)
The Game Description Language GDL is the standard input language for general game-playing systems. While players can gain a lot of traction by an efficient inference algorithm for GDL, state-of-the-art reasoners suffer from a variant of a classical KR problem, the inferential frame problem. We present a method by which general game players can transform any given game description into a representation that solves this problem. Our experimental results demonstrate that with the help of automatically generated domain knowledge, a significant speedup can thus be obtained for the majority of the game descriptions from the AAAI competition.
- Oceania > Australia > New South Wales (0.04)
- Europe > Germany > Brandenburg > Potsdam (0.04)
- Oceania > New Zealand > North Island > Auckland Region > Auckland (0.04)
- (7 more...)
The Epistemic Logic Behind the Game Description Language
Ruan, Ji (The University of New South Wales) | Thielscher, Michael (The University of New South Wales)
A general game player automatically learns to play arbitrary new games solely by being told their rules. For this purpose games are specified in the game description language GDL, a variant of Datalog with function symbols and a few known keywords. In its latest version GDL allows to describe nondeterministic games with any number of players who may have imperfect, asymmetric information. We analyse the epistemic structure and expressiveness of this language in terms of epistemic modal logic and present two main results: The operational semantics of GDL entails that the situation at any stage of a game can be characterised by a multi-agent epistemic (i.e., S5-) model; (2) GDL is sufficiently expressive to model any situation that can be described by a (finite) multi-agent epistemic model.
- Oceania > Australia > New South Wales (0.04)
- North America > United States > California > Santa Clara County > Palo Alto (0.04)
- Europe > Germany > Brandenburg > Potsdam (0.04)
Efficient Implementation of the Plan Graph in STAN
The implementation is based on two insights: that many of the graph construction operations can be implemented as bit-level logical operations on bit vectors, and that the graph should not be explicitly constructed beyond the x point. A more detailed discussion of the competition, from the competitors' point of view, is in preparation. First, we observe that action pre-and post-conditions can be represented using bit vectors. Checking for mutual exclusion between pairs of actions which directly interact can be implemented using logical operations on these bit vectors. Mutual exclusion (mutex relations) between facts can be implemented in a similar way. Second, we observe that there is no advantage in explicit construction of the graph beyond the stage at which the x point is reached. Since no new facts, actions or mutex relations are added beyond the x point these goal sets can be considered without explicit copying of the fact and action layers. For example, using a heuristic discussed in Section 5.1, Sta In this paper we describe the spike and wave front mechanisms and provide experimental results indicating the performance advantages obtained. The layers correspond to snapshots of possible states at instants on a time line from the initial to the goal state.
- Asia > Vietnam > Hanoi > Hanoi (0.05)
- North America > United States > Arizona (0.04)
- Europe > United Kingdom > England > Durham > Durham (0.04)
- Europe > Germany > Baden-Württemberg > Freiburg (0.04)
Cyclic Equilibria in Markov Games
Zinkevich, Martin, Greenwald, Amy, Littman, Michael L.
Although variants of value iteration have been proposed for finding Nash or correlated equilibria in general-sum Markov games, these variants have not been shown to be effective in general. In this paper, we demonstrate by construction that existing variants of value iteration cannot find stationary equilibrium policies in arbitrary general-sum Markov games. Instead, we propose an alternative interpretation of the output of value iteration based on a new (non-stationary) equilibrium concept that we call "cyclic equilibria." We prove that value iteration identifies cyclic equilibria in a class of games in which it fails to find stationary equilibria. We also demonstrate empirically that value iteration finds cyclic equilibria in nearly all examples drawn from a random distribution of Markov games.
- North America > Canada > Alberta (0.04)
- North America > United States > Rhode Island > Providence County > Providence (0.04)
- North America > United States > New Jersey > Middlesex County > Piscataway (0.04)
- North America > United States > New Jersey > Mercer County > Princeton (0.04)
Cyclic Equilibria in Markov Games
Zinkevich, Martin, Greenwald, Amy, Littman, Michael L.
Although variants of value iteration have been proposed for finding Nash or correlated equilibria in general-sum Markov games, these variants have not been shown to be effective in general. In this paper, we demonstrate by construction that existing variants of value iteration cannot find stationary equilibrium policies in arbitrary general-sum Markov games. Instead, we propose an alternative interpretation of the output of value iteration based on a new (non-stationary) equilibrium concept that we call "cyclic equilibria." We prove that value iteration identifies cyclic equilibria in a class of games in which it fails to find stationary equilibria. We also demonstrate empirically that value iteration finds cyclic equilibria in nearly all examples drawn from a random distribution of Markov games.
- North America > Canada > Alberta (0.04)
- North America > United States > Rhode Island > Providence County > Providence (0.04)
- North America > United States > New Jersey > Middlesex County > Piscataway (0.04)
- North America > United States > New Jersey > Mercer County > Princeton (0.04)
Cyclic Equilibria in Markov Games
Zinkevich, Martin, Greenwald, Amy, Littman, Michael L.
Although variants of value iteration have been proposed for finding Nash or correlated equilibria in general-sum Markov games, these variants have not been shown to be effective in general. In this paper, we demonstrate byconstruction that existing variants of value iteration cannot find stationary equilibrium policies in arbitrary general-sum Markov games. Instead, we propose an alternative interpretation of the output of value iteration basedon a new (non-stationary) equilibrium concept that we call "cyclic equilibria." We prove that value iteration identifies cyclic equilibria ina class of games in which it fails to find stationary equilibria. We also demonstrate empirically that value iteration finds cyclic equilibria in nearly all examples drawn from a random distribution of Markov games.
- North America > Canada > Alberta (0.04)
- North America > United States > Rhode Island > Providence County > Providence (0.04)
- North America > United States > New Jersey > Middlesex County > Piscataway (0.04)
- North America > United States > New Jersey > Mercer County > Princeton (0.04)