Reinforcement Learning
Pre-training Neural Networks with Human Demonstrations for Deep Reinforcement Learning
Cruz, Gabriel V. de la Jr, Du, Yunshu, Taylor, Matthew E.
Deep reinforcement learning (deep RL) has achieved superior performance in complex sequential tasks by using a deep neural network as its function approximator and by learning directly from raw images. A drawback of using raw images is that deep RL must learn the state feature representation from the raw images in addition to learning a policy. As a result, deep RL can require a prohibitively large amount of training time and data to reach reasonable performance, making it difficult to use deep RL in real-world applications, especially when data is expensive. In this work, we speed up training by addressing half of what deep RL is trying to solve --- learning features. Our approach is to learn some of the important features by pre-training deep RL network's hidden layers via supervised learning using a small set of human demonstrations. We empirically evaluate our approach using deep Q-network (DQN) and asynchronous advantage actor-critic (A3C) algorithms on the Atari 2600 games of Pong, Freeway, and Beamrider. Our results show that: 1) pre-training with human demonstrations in a supervised learning manner is better at discovering features relative to pre-training naively in DQN, and 2) initializing a deep RL network with a pre-trained model provides a significant improvement in training time even when pre-training from a small number of human demonstrations.
Multi-Preference Actor Critic
Durugkar, Ishan, Hausknecht, Matthew, Swaminathan, Adith, MacAlpine, Patrick
Policy gradient algorithms typically combine discounted future rewards with an estimated value function, to compute the direction and magnitude of parameter updates. However, for most Reinforcement Learning tasks, humans can provide additional insight to constrain the policy learning. We introduce a general method to incorporate multiple different feedback channels into a single policy gradient loss. In our formulation, the Multi-Preference Actor Critic (M-PAC), these different types of feedback are implemented as constraints on the policy. We use a Lagrangian relaxation to satisfy these constraints using gradient descent while learning a policy that maximizes rewards. Experiments in Atari and Pendulum verify that constraints are being respected and can accelerate the learning process.
Synthesized Policies for Transfer and Adaptation across Tasks and Environments
Hu, Hexiang, Chen, Liyu, Gong, Boqing, Sha, Fei
The ability to transfer in reinforcement learning is key towards building an agent of general artificial intelligence. In this paper, we consider the problem of learning to simultaneously transfer across both environments (ENV) and tasks (TASK), probably more importantly, by learning from only sparse (ENV, TASK) pairs out of all the possible combinations. We propose a novel compositional neural network architecture which depicts a meta rule for composing policies from the environment and task embeddings. Notably, one of the main challenges is to learn the embeddings jointly with the meta rule. We further propose new training methods to disentangle the embeddings, making them both distinctive signatures of the environments and tasks and effective building blocks for composing the policies. Experiments on GridWorld and Thor, of which the agent takes as input an egocentric view, show that our approach gives rise to high success rates on all the (ENV, TASK) pairs after learning from only 40\% of them.
Reducing catastrophic forgetting when evolving neural networks
A key stepping stone in the development of an artificial general intelligence (a machine that can perform any task), is the production of agents that can perform multiple tasks at once instead of just one. Unfortunately, canonical methods are very prone to catastrophic forgetting (CF) - the act of overwriting previous knowledge about a task when learning a new task. Recent efforts have developed techniques for overcoming CF in learning systems, but no attempt has been made to apply these new techniques to evolutionary systems. This research presents a novel technique, weight protection, for reducing CF in evolutionary systems by adapting a method from learning systems. It is used in conjunction with other evolutionary approaches for overcoming CF and is shown to be effective at alleviating CF when applied to a suite of reinforcement learning tasks. It is speculated that this work could indicate the potential for a wider application of existing learning-based approaches to evolutionary systems and that evolutionary techniques may be competitive with or better than learning systems when it comes to reducing CF.
Structured agents for physical construction
Bapst, Victor, Sanchez-Gonzalez, Alvaro, Doersch, Carl, Stachenfeld, Kimberly L., Kohli, Pushmeet, Battaglia, Peter W., Hamrick, Jessica B.
Physical construction -- the ability to compose objects, subject to physical dynamics, in order to serve some function -- is fundamental to human intelligence. Here we introduce a suite of challenging physical construction tasks inspired by how children play with blocks, such as matching a target configuration, stacking and attaching blocks to connect objects together, and creating shelter-like structures over target objects. We then examine how a range of modern deep reinforcement learning agents fare on these challenges, and introduce several new approaches which provide superior performance. Our results show that agents which use structured representations (e.g., objects and scene graphs) and structured policies (e.g., object-centric actions) outperform those which use less structured representations, and generalize better beyond their training when asked to reason about larger scenes. Agents which use model-based planning via Monte-Carlo Tree Search also outperform strictly model-free agents in our most challenging construction problems. We conclude that approaches which combine structured representations and reasoning with powerful learning are a key path toward agents that possess rich intuitive physics, scene understanding, and planning.
Self-Adapting Goals Allow Transfer of Predictive Models to New Tasks
Ellefsen, Kai Olav, Torresen, Jim
A long-standing challenge in Reinforcement Learning is enabling agents to learn a model of their environment which can be transferred to solve other problems in a world with the same underlying rules. One reason this is difficult is the challenge of learning accurate models of an environment. If such a model is inaccurate, the agent's plans and actions will likely be sub-optimal, and likely lead to the wrong outcomes. Recent progress in model-based reinforcement learning has improved the ability for agents to learn and use predictive models. In this paper, we extend a recent deep learning architecture which learns a predictive model of the environment that aims to predict only the value of a few key measurements, which are be indicative of an agent's performance. Predicting only a few measurements rather than the entire future state of an environment makes it more feasible to learn a valuable predictive model. We extend this predictive model with a small, evolving neural network that suggests the best goals to pursue in the current state. We demonstrate that this allows the predictive model to transfer to new scenarios where goals are different, and that the adaptive goals can even adjust agent behavior on-line, changing its strategy to fit the current context.
Artificial Intelligence and Game Theory Models for Defending Critical Networks with Cyber Deception
Fugate, Sunny (Space and Naval Warfare Systems Center Pacific) | Ferguson-Walter, Kimberly (US Department of Defense)
Traditional cyber security techniques have led to an asymmetric disadvantage for defenders. The defender must detect all possible threats at all times from all attackers and defend all systems against all possible exploitation. In contrast, an attacker needs only to find a single path to the defenderโs critical information. In this article, we discuss how this asymmetry can be rebalanced using cyber deception to change the attackerโs perception of the network environment, and lead attackers to false beliefs about which systems contain critical information or are critical to a defenderโs computing infrastructure. We introduce game theory concepts and models to represent and reason over the use of cyber deception by the defender and the effect it has on attacker perception. Finally, we discuss techniques for combining artificial intelligence algorithms with game theory models to estimate hidden states of the attacker using feedback through payoffs to learn how best to defend the system using cyber deception. It is our opinion that adaptive cyber deception is a necessary component of future information systems and networks. The techniques we present can simultaneously decrease the risks and impacts suffered by defenders and dramatically increase the costs and risks of detection for attackers. Such techniques are likely to play a pivotal role in defending national and international security concerns.
Reinforcement Learning: Connections, Surprises, and Challenge
Barto, Andrew G. (University of Massachusetts Amherst)
The idea of implementing reinforcement learning in a computer was one of the earliest ideas about the possibility of AI, but reinforcement learning remained on the margin of AI until relatively recently. Today we see reinforcement learning playing essential roles in some of the most impressive AI applications. This article presents observations from the authorโs personal experience with reinforcement learning over the most recent 40 years of its history in AI, focusing on striking connections that emerged between largely separate disciplines and on some of the findings that surprised him along the way. These connections and surprises place reinforcement learning in a historical context, and they help explain the success it is finding in modern AI. The article concludes by discussing some of the challenges that need to be faced as reinforcement learning moves out into real world.
Can a Robot Become a Movie Director? Learning Artistic Principles for Aerial Cinematography
Gschwindt, Mirko, Camci, Efe, Bonatti, Rogerio, Wang, Wenshan, Kayacan, Erdal, Scherer, Sebastian
Aerial filming is becoming more and more popular thanks to the recent advances in drone technology. It invites many intriguing, unsolved problems at the intersection of aesthetical and scientific challenges. In this work, we propose an intelligent agent which supervises motion planning of a filming drone based on aesthetical values of video shots using deep reinforcement learning. Unlike the current state-of-the-art approaches which mostly require explicit guidance by a human expert, our drone learns how to make favorable shot type selections by experience. We propose a learning scheme which exploits aesthetical features of retrospective shots in order to extract a desirable policy for better prospective shots. We train our agent in realistic AirSim simulations using both hand-crafted and human reward functions. We deploy the same agent on a real DJI M210 drone in order to test generalization capability of our approach to real world conditions. To evaluate the success of our approach in the end, we conduct a comprehensive user study in which participants rate the shots taken using our method and write comments about them.
Reinforcement Learning Demystified: Markov Decision Processes (Part 1)
In the previous blog post we talked about reinforcement learning and its characteristics. We mentioned the process of the agent observing the environment output consisting of a reward and the next state, and then acting upon that. This whole process is a Markov Decision Process or an MDP for short. This blog post is a bit mathy. Grab your coffee and a comfortable chair, and just dive in.