Google DeepMind
Reports on the 2018 AAAI Spring Symposium Series
Amato, Christopher (Northeastern University) | Ammar, Haitham Bou (PROWLER.io) | Churchill, Elizabeth (Google) | Karpas, Erez (Technion - Israel Institute of Technology) | Kido, Takashi (Stanford University) | Kuniavsky, Mike (Parc) | Lawless, W. F. (Paine College) | Rossi, Francesca (IBM T. J. Watson Research Center and University of Padova) | Oliehoek, Frans A. (TU Delft) | Russell, Stephen (US Army Research Laboratory) | Takadama, Keiki (University of Electro-Communications) | Srivastava, Siddharth (Arizona State University) | Tuyls, Karl (Google DeepMind) | Allen, Philip Van (Art Center College of Design) | Venable, K. Brent (Tulane University and IHMC) | Vrancx, Peter (PROWLER.io) | Zhang, Shiqi (Cleveland State University)
The Association for the Advancement of Artificial Intelligence, in cooperation with Stanford Universityโs Department of Computer Science, presented the 2018 Spring Symposium Series, held Monday through Wednesday, March 26โ28, 2018, on the campus of Stanford University. The seven symposia held were AI and Society: Ethics, Safety and Trustworthiness in Intelligent Agents; Artificial Intelligence for the Internet of Everything; Beyond Machine Intelligence: Understanding Cognitive Bias and Humanity for Well-Being AI; Data Efficient Reinforcement Learning; The Design of the User Experience for Artificial Intelligence (the UX of AI); Integrated Representation, Reasoning, and Learning in Robotics; Learning, Inference, and Control of Multi-Agent Systems. This report, compiled from organizers of the symposia, summarizes the research of five of the symposia that took place.
Deep Q-learning From Demonstrations
Hester, Todd (Google DeepMind) | Vecerik, Matej (Google DeepMind) | Pietquin, Olivier (Google DeepMind ) | Lanctot, Marc (Google DeepMind) | Schaul, Tom (Google DeepMind) | Piot, Bilal (Google DeepMind ) | Horgan, Dan (Google DeepMind) | Quan, John (Google DeepMind) | Sendonaris, Andrew (Google DeepMind ) | Osband, Ian (Google DeepMind) | Dulac-Arnold, Gabriel (Google DeepMind) | Agapiou, John (Google DeepMind) | Leibo, Joel Z. (Google DeepMind) | Gruslys, Audrunas (Google DeepMind )
Deep reinforcement learning (RL) has achieved several high profile successes in difficult decision-making problems. However, these algorithms typically require a huge amount of data before they reach reasonable performance. In fact, their performance during learning can be extremely poor. This may be acceptable for a simulator, but it severely limits the applicability of deep RL to many real-world tasks, where the agent must learn in the real environment. In this paper we study a setting where the agent may access data from previous control of the system. We present an algorithm, Deep Q-learning from Demonstrations (DQfD), that leverages small sets of demonstration data to massively accelerate the learning process even from relatively small amounts of demonstration data and is able to automatically assess the necessary ratio of demonstration data while learning thanks to a prioritized replay mechanism. DQfD works by combining temporal difference updates with supervised classification of the demonstratorโs actions. We show that DQfD has better initial performance than Prioritized Dueling Double Deep Q-Networks (PDD DQN) as it starts with better scores on the first million steps on 41 of 42 games and on average it takes PDD DQN 83 million steps to catch up to DQfDโs performance. DQfD learns to out-perform the best demonstration given in 14 of 42 games. In addition, DQfD leverages human demonstrations to achieve state-of-the-art results for 11 games. Finally, we show that DQfD performs better than three related algorithms for incorporating demonstration data into DQN.
Increasing the Action Gap: New Operators for Reinforcement Learning
Bellemare, Marc G. (Google DeepMind) | Ostrovski, Georg (Google DeepMind) | Guez, Arthur (Google DeepMind) | Thomas, Philip S. (Google DeepMind) | Munos, Remi (Google DeepMind)
This paper introduces new optimality-preserving operators on Q-functions. We first describe an operator for tabular representations, the consistent Bellman operator, which incorporates a notion of local policy consistency. We show that this local consistency leads to an increase in the action gap at each state; increasing this gap, we argue, mitigates the undesirable effects of approximation and estimation errors on the induced greedy policies. This operator can also be applied to discretized continuous space and time problems, and we provide empirical results evidencing superior performance in this context. Extending the idea of a locally consistent operator, we then derive sufficient conditions for an operator to preserve optimality, leading to a family of operators which includes our consistent Bellman operator. As corollaries we provide a proof of optimality for Baird's advantage learning algorithm and derive other gap-increasing operators with interesting properties. We conclude with an empirical study on 60 Atari 2600 games illustrating the strong potential of these new operators.
Generalized Emphatic Temporal Difference Learning: Bias-Variance Analysis
Hallak, Assaf (Technion Institute of Technology) | Tamar, Aviv (University of California, Berkeley) | Munos, Remi (Google DeepMind) | Mannor, Shie (Technion Institute of Technology)
We consider the off-policy evaluation problem in Markov decision processes with function approximation. We propose a generalization of the recently introduced emphatic temporal differences (ETD) algorithm, which encompasses the original ETD(ฮป), as well as several other off-policy evaluation algorithms as special cases. We call this framework ETD(ฮป, ฮฒ), where our introduced parameter ฮฒ controls the decay rate of an importance-sampling term. We study conditions under which the projected fixed-point equation underlying ETD(ฮป, ฮฒ) involves a contraction operator, allowing us to present the first asymptotic error bounds (bias) for ETD(ฮป, ฮฒ). Our results show that the original ETD algorithm always involves a contraction operator, and its bias is bounded. Moreover, by controlling ฮฒ, our proposed generalization allows trading-off bias for variance reduction, thereby achieving a lower total error.
Deep Reinforcement Learning with Double Q-Learning
Hasselt, Hado van (Google DeepMind) | Guez, Arthur (Google DeepMind) | Silver, David (Google DeepMind)
The popular Q-learning algorithm is known to overestimate action values under certain conditions. It was not previously known whether, in practice, such overestimations are common, whether they harm performance, and whether they can generally be prevented. In this paper, we answer all these questions affirmatively. In particular, we first show that the recent DQN algorithm, which combines Q-learning with a deep neural network, suffers from substantial overestimations in some games in the Atari 2600 domain. We then show that the idea behind the Double Q-learning algorithm, which was introduced in a tabular setting, can be generalized to work with large-scale function approximation. We propose a specific adaptation to the DQN algorithm and show that the resulting algorithm not only reduces the observed overestimations, as hypothesized, but that this also leads to much better performance on several games.
Compress and Control
Veness, Joel (Google DeepMind) | Bellemare, Marc G (Google DeepMind) | Hutter, Marcus (Australian National University) | Chua, Alvin (Google DeepMind) | Desjardins, Guillaume (Google DeepMind)
This paper describes a new information-theoretic policy evaluation technique for reinforcement learning. This technique converts any compression or density model into a corresponding estimate of value. Under appropriate stationarity and ergodicity conditions, we show that the use of a sufficiently powerful model gives rise to a consistent value function estimator. We also study the behavior of this technique when applied to various Atari 2600 video games, where the use of suboptimal modeling techniques is unavoidable. We consider three fundamentally different models, all too limited to perfectly model the dynamics of the system. Remarkably, we find that our technique provides sufficiently accurate value estimates for effective on-policy control. We conclude with a suggestive study highlighting the potential of our technique to scale to large problems.