Reinforcement Learning
A Deep Hierarchical Approach to Lifelong Learning in Minecraft
Tessler, Chen (Technion) | Givony, Shahar (Technion) | Zahavy, Tom (Technion) | Mankowitz, Daniel J. (Technion) | Mannor, Shie (Technion)
We propose a lifelong learning system that has the ability to reuse and transfer knowledge from one task to another while efficiently retaining the previously learned knowledge-base. Knowledge is transferred by learning reusable skills to solve tasks in Minecraft, a popular video game which is an unsolved and high-dimensional lifelong learning problem. These reusable skills, which we refer to as Deep Skill Networks, are then incorporated into our novel Hierarchical Deep Reinforcement Learning Network (H-DRLN) architecture using two techniques: (1) a deep skill array and (2) skill distillation, our novel variation of policy distillation (Rusu et. al. 2015) for learning skills. Skill distillation enables the H-DRLN to efficiently retain knowledge and therefore scale in lifelong learning, by accumulating knowledge and encapsulating multiple reusable skills into a single distilled network. The H-DRLN exhibits superior performance and lower learning sample complexity compared to the regular Deep Q Network (Mnih et. al. 2015) in sub-domains of Minecraft.
Fast Inverse Reinforcement Learning with Interval Consistent Graph for Driving Behavior Prediction
Shimosaka, Masamichi (Tokyo Institute of Technology) | Sato, Junichi (The University of Tokyo) | Takenaka, Kazuhito (Denso Corporation) | Hitomi, Kentarou (Denso Corporation)
In contrast, Inverse reinforcement learning (IRL), inverse optimal control, a discrete approach guarantees global optimality once and imitation learning(Ng and Russell 2000; Abbeel proper discrete state space is given, hence it is more suitable and Ng 2004) are modeling frameworks for acquiring rewards for driving behavior modeling. In a discrete approach, (or cost) of a certain environment by using the optimal the calculation cost of MaxEnt IRL is O( S A), where S path under a possibly different environment as training is the number of states and A is the number of actions data. In particular, in human behavior modeling, it is (Ziebart and others 2008). That is, the key for fast prediction shown that human-centered rewards can be obtained with is suppressing the increase of S depending on dimensions maximum entropy inverse reinforcement learning (MaxEnt and preparing a necessary and sufficient action set, A, IRL)(Ziebart and others 2008), which allows suboptimal for representing driving behavior. As examples of existing training data (Huang et al. 2015; Vernaza and Bagnell 2012; discretization schemes, there are mesh grid representation Dragan and Srinivasa 2012; Walker, Gupta, and Hebert (Shimosaka, Kaneko, and Nishi 2014) and random graph 2014). For instance, Ziebart et al. (Ziebart et al. 2008) modeled based representation connected with neighbors (Byravan et the driving behavior of expert taxi drivers and enabled al. 2015). In these approaches, however, A for general dynamic driving behavior prediction based on the experts' very own systems is not trivial. This is because neighbors on experience or knowledge. MaxEnt IRL based driving behavior state space defined by Euclidean distance do not necessarily prediction, which balances safety, comfort, and economic correspond to the transition area of general dynamics performance, is very promising.
Learning to Act by Predicting the Future
Dosovitskiy, Alexey, Koltun, Vladlen
We present an approach to sensorimotor control in immersive environments. Our approach utilizes a high-dimensional sensory stream and a lower-dimensional measurement stream. The cotemporal structure of these streams provides a rich supervisory signal, which enables training a sensorimotor control model by interacting with the environment. The model is trained using supervised learning techniques, but without extraneous supervision. It learns to act based on raw sensory input from a complex three-dimensional environment. The presented formulation enables learning without a fixed goal at training time, and pursuing dynamically changing goals at test time. We conduct extensive experiments in three-dimensional simulations based on the classical first-person game Doom. The results demonstrate that the presented approach outperforms sophisticated prior formulations, particularly on challenging tasks. The results also show that trained models successfully generalize across environments and goals. A model trained using the presented approach won the Full Deathmatch track of the Visual Doom AI Competition, which was held in previously unseen environments.
This Week's Awesome Stories From Around the Web (Through February 11th)
Understanding Agent Cooperation Joel Leibo, Vinicius Zambaldi, Marc Lanctot, Janusz Marecki, Thore Graepel Google DeepMind Blog "Recent progress in artificial intelligence and specifically deep reinforcement learning provides us with the tools to look at the problem of social dilemmas through a new lens... we showed that we can apply the modern AI technique of deep multi-agent reinforcement learning to age-old questions in social science such as the mystery of the emergence of cooperation." Agility Robotics Introduces Cassie, a Dynamic and Talented Robot Delivery Ostrich Evan Ackerman IEEE Spectrum "Agility Robotics, a spin-off of Oregon State University, is officially announcing a shiny new bipedal robot named Cassie. Cassie is a dynamic walker, meaning that it walks much more like humans do than most of the carefully plodding bipedal robots we're used to seeing... Cassie has some work to do before it's ready to be hauling groceries up stairs for you, but we're very much looking forward to watching this robot taking more steps toward robust and dynamic legged locomotion." How Escape Rooms and Live Theater Are Paving the Way for VR Bryan Bishop The Verge "Cinema has had more than a century to develop its own language of shots, cuts, and transitions, while storytelling in VR is still in its infancy... creators seem to be zeroing in on interactive, experiential moments as one of the key building blocks of VR storytelling. One of Chris Milk's next projects is a piece set in the Planet of the Apes universe that will lean heavily on AI to drive interactive character performances."
Batch Policy Gradient Methods for Improving Neural Conversation Models
Kandasamy, Kirthevasan, Bachrach, Yoram, Tomioka, Ryota, Tarlow, Daniel, Carter, David
We study reinforcement learning of chatbots with recurrent neural network architectures when the rewards are noisy and expensive to obtain. For instance, a chatbot used in automated customer service support can be scored by quality assurance agents, but this process can be expensive, time consuming and noisy. Previous reinforcement learning work for natural language processing uses on-policy updates and/or is designed for on-line learning settings. We demonstrate empirically that such strategies are not appropriate for this setting and develop an off-policy batch policy gradient method (BPG). We demonstrate the efficacy of our method via a series of synthetic experiments and an Amazon Mechanical Turk experiment on a restaurant recommendations dataset.
Reinforcement Learning as a Service
I've been integrating reinforcement learning into an actual product for the last 6 months, and therefore I'm developing an appreciation for what are likely to be common problems. In particular, I'm now sold on the idea of reinforcement learning as a service, of which the decision service from MSR-NY is an early example (limited to contextual bandits at the moment, but incorporating key system insights). Service, not algorithm Supervised learning is essentially observational: some data has been collected and subsequently algorithms are run on it. In contrast, counterfactual learning is very difficult do to observationally. Diverse fields such as economics, political science, and epidemiology all attempt to make counterfactual conclusions using observational data, essentially because this is the only data available (at an affordable cost).
Adversarial Attacks on Neural Network Policies
Huang, Sandy, Papernot, Nicolas, Goodfellow, Ian, Duan, Yan, Abbeel, Pieter
Machine learning classifiers are known to be vulnerable to inputs maliciously constructed by adversaries to force misclassification. Such adversarial examples have been extensively studied in the context of computer vision applications. In this work, we show adversarial attacks are also effective when targeting neural network policies in reinforcement learning. Specifically, we show existing adversarial example crafting techniques can be used to significantly degrade test-time performance of trained policies. Our threat model considers adversaries capable of introducing small perturbations to the raw input of the policy. We characterize the degree of vulnerability across tasks and training algorithms, for a subclass of adversarial-example attacks in white-box and black-box settings. Regardless of the learned task or training algorithm, we observe a significant drop in performance, even with small adversarial perturbations that do not interfere with human perception. Videos are available at http://rll.berkeley.edu/adversarial.
Value Alignment or Misalignment -- What Will Keep Systems Accountable?
Arnold, Thomas (Tufts University) | Kasenberg, Daniel (Tufts University) | Scheutz, Matthias (Tufts University)
Machine learning's advances have led to new ideas about the feasibility and importance of machine ethics keeping pace, with increasing emphasis on safety, containment, and alignment. This paper addresses a recent suggestion that inverse reinforcement learning (IRL) could be a means to so-called "value alignment.'' We critically consider how such an approach can engage the social, norm-infused nature of ethical action and outline several features of ethical appraisal that go beyond simple models of behavior, including unavoidably temporal dimensions of norms and counterfactuals. We propose that a hybrid approach for computational architectures still offers the most promising avenue for machines acting in an ethical fashion.
Causal Learning versus Reinforcement Learning for Knowledge Learning and Problem Solving
Ho, Seng-Beng (Institute of High Performance Computing)
Causal learning and reinforcement learning are both important AI learning mechanisms but are usually treated separately, despite the fact that both are directly relevant to problem solving processes. In this paper we propose a method for causal learning and problem solving, and compare and contrast that with AI reinforcement learning and show that the two methods are actually related, differing only in the values of the learning rate α and discount factor γ. However, the causal learning framework emphasizes quick but non-optimal concoction of problem solutions while AI reinforcement learning generates optimal solutions at the expense of speed. Cognitive science literature is reviewed and it is found that psychological reinforcement learning in lower form animals such as mammals is distinct from AI reinforcement learning in that psychological reinforcement learning strives neither for speed nor optimality, and that higher form animals such as humans and primates employ quick causal learning for survival instead of reinforcement learning. AI systems should likewise take advantage of a framework that employs rapid inductive causal learning to generate problem solutions for its general viability in terms of rapid adaptability, without the need to always strive for optimality.
Clyde: A Deep Reinforcement Learning DOOM Playing Agent
Ratcliffe, Dino Stephen (University of Essex) | Devlin, Sam (University of York) | Kruschwitz, Udo (University of Essex) | Citi, Luca (University of Essex)
In many cases games provide noise free computer science at Poznan University. It provides an interface environments and can also encompass the whole world state for AI agents to learn from the raw visual data that is in data structures easily. Much of the early work in this produced by DOOM (Kempka et al. 2016). They also run a domain has focussed on digital implementations of board competition that places these agents into death matches in games, such as backgammon (Tesauro 1995), chess (Campbell, order to compare their performance. A death match in the Hoane, and Hsu 2002) and more recently go (Silver case of this competition is a time limited game mode where et al. 2016). These games have then been used to benchmark each agent must accumulate the highest score possible by many different approaches, including tree search approaches killing other agents in the match. This is where our agent was such as Monte Carlo Tree Search (MCTS) (Browne et al. submitted in order to assess its performance against other 2012) along with other approaches such as deep reinforcement agents.