Reinforcement Learning
Adversarial Reinforcement Learning Framework for Benchmarking Collision Avoidance Mechanisms in Autonomous Vehicles
Behzadan, Vahid, Munir, Arslan
It is widely believed that the transportation systems of future will be dominated by autonomous vehicles (AVs). With the rapid advancements of this field in recent years, many have come to predict that this shift will occur within the next ten years. A major motivation for the interest and push towards development of AVs stems from the demand for safer transportation. It is generally assumed that replacing the intrinsic imperfections of human drivers with expert computational models may significantly reduce the number of accidents caused by driver error [1]. Yet, development of reliable and robust AV technologies remains an ongoing challenge, and is actively pursued from various directions of research and development [2]. Of particular importance is the research on reliable motion planning and collision avoidance mechanisms. Over the span of multiple decades, numerous approaches towards this problem have been proposed [3], ranging from control theoretic formalizations and optimal control methods to potential field-and rule-based techniques. More recently, advances in machine learning have enabled new data-driven approaches to collision avoidance based on techniques such as imitation learning [4] and deep Reinforcement Learning (RL) [5]. However, with the growing complexity in their deployment settings and mechanisms, the challenge of providing safety guarantees on these solutions is becoming increasingly difficult [2].
Investigating Human Priors for Playing Video Games
Dubey, Rachit, Agrawal, Pulkit, Pathak, Deepak, Griffiths, Thomas L., Efros, Alexei A.
What makes humans so good at solving seemingly complex video games? Unlike computers, humans bring in a great deal of prior knowledge about the world, enabling efficient decision making. This paper investigates the role of human priors for solving video games. Given a sample game, we conduct a series of ablation studies to quantify the importance of various priors on human performance. We do this by modifying the video game environment to systematically mask different types of visual information that could be used by humans as priors. We find that removal of some prior knowledge causes a drastic degradation in the speed with which human players solve the game, e.g. from 2 minutes to over 20 minutes. Furthermore, our results indicate that general priors, such as the importance of objects and visual consistency, are critical for efficient game-play.
Automatic Goal Generation for Reinforcement Learning Agents
Held, David, Geng, Xinyang, Florensa, Carlos, Abbeel, Pieter
Reinforcement learning is a powerful technique to train an agent to perform a task. However, an agent that is trained using reinforcement learning is only capable of achieving the single task that is specified via its reward function. Such an approach does not scale well to settings in which an agent needs to perform a diverse set of tasks, such as navigating to varying positions in a room or moving objects to varying locations. Instead, we propose a method that allows an agent to automatically discover the range of tasks that it is capable of performing. We use a generator network to propose tasks for the agent to try to achieve, specified as goal states. The generator network is optimized using adversarial training to produce tasks that are always at the appropriate level of difficulty for the agent. Our method thus automatically produces a curriculum of tasks for the agent to learn. We show that, by using this framework, an agent can efficiently and automatically learn to perform a wide set of tasks without requiring any prior knowledge of its environment. Our method can also learn to achieve tasks with sparse rewards, which traditionally pose significant challenges.
Mitigation of Policy Manipulation Attacks on Deep Q-Networks with Parameter-Space Noise
Behzadan, Vahid, Munir, Arslan
Recent developments have established the vulnerability of deep reinforcement learning to policy manipulation attacks via intentionally perturbed inputs, known as adversarial examples. In this work, we propose a technique for mitigation of such attacks based on addition of noise to the parameter space of deep reinforcement learners during training. We experimentally verify the effect of parameter-space noise in reducing the transferability of adversarial examples, and demonstrate the promising performance of this technique in mitigating the impact of whitebox and blackbox attacks at both test and training times.
Playing Atari with Six Neurons
Cuccu, Giuseppe, Togelius, Julian, Cudre-Mauroux, Philippe
Deep reinforcement learning on Atari games maps pixel directly to actions; internally, the deep neural network bears the responsibility of both extracting useful information and making decisions based on it. Aiming at devoting entire deep networks to decision making alone, we propose a new method for learning policies and compact state representations separately but simultaneously for policy approximation in reinforcement learning. State representations are generated by a novel algorithm based on Vector Quantization and Sparse Coding, trained online along with the network, and capable of growing its dictionary size over time. We also introduce new techniques allowing both the neural network and the evolution strategy to cope with varying dimensions. This enables networks of only 6 to 18 neurons to learn to play a selection of Atari games with performance comparable---and occasionally superior---to state-of-the-art techniques using evolution strategies on deep networks two orders of magnitude larger.
Importance Sampling Policy Evaluation with an Estimated Behavior Policy
Hanna, Josiah, Niekum, Scott, Stone, Peter
In reinforcement learning, off-policy evaluation is the task of using data generated by one policy to determine the expected return of a second policy. Importance sampling is a standard technique for off-policy evaluation, allowing off-policy data to be used as if it were on-policy. When the policy that generated the off-policy data is unknown, the ordinary importance sampling estimator cannot be applied. In this paper, we study a family of regression importance sampling (RIS) methods that apply importance sampling by first estimating the behavior policy. We find that these estimators give strong empirical performance---surprisingly often outperforming importance sampling with the true behavior policy in both discrete and continuous domains. Our results emphasize the importance of estimating the behavior policy using only the data that will also be used for the importance sampling estimate.
Relational inductive bias for physical construction in humans and machines
Hamrick, Jessica B., Allen, Kelsey R., Bapst, Victor, Zhu, Tina, McKee, Kevin R., Tenenbaum, Joshua B., Battaglia, Peter W.
While current deep learning systems excel at tasks such as object classification, language processing, and gameplay, few can construct or modify a complex system such as a tower of blocks. We hypothesize that what these systems lack is a "relational inductive bias": a capacity for reasoning about inter-object relations and making choices over a structured description of a scene. To test this hypothesis, we focus on a task that involves gluing pairs of blocks together to stabilize a tower, and quantify how well humans perform. We then introduce a deep reinforcement learning agent which uses object- and relation-centric scene and policy representations and apply it to the task. Our results show that these structured representations allow the agent to outperform both humans and more naive approaches, suggesting that relational inductive bias is an important component in solving structured reasoning problems and for building more intelligent, flexible machines.
Measuring and avoiding side effects using relative reachability
Krakovna, Victoria, Orseau, Laurent, Martic, Miljan, Legg, Shane
How can we design reinforcement learning agents that avoid causing unnecessary disruptions to their environment? We argue that current approaches to penalizing side effects can introduce bad incentives in tasks that require irreversible actions, and in environments that contain sources of change other than the agent. For example, some approaches give the agent an incentive to prevent any irreversible changes in the environment, including the actions of other agents. We introduce a general definition of side effects, based on relative reachability of states compared to a default state, that avoids these undesirable incentives. Using a set of gridworld experiments illustrating relevant scenarios, we empirically compare relative reachability to penalties based on existing definitions and show that it is the only penalty among those tested that produces the desired behavior in all the scenarios.
TD or not TD: Analyzing the Role of Temporal Differencing in Deep Reinforcement Learning
Amiranashvili, Artemij, Dosovitskiy, Alexey, Koltun, Vladlen, Brox, Thomas
Our understanding of reinforcement learning (RL) has been shaped by theoretical and empirical results that were obtained decades ago using tabular representations and linear function approximators. These results suggest that RL methods that use temporal differencing (TD) are superior to direct Monte Carlo estimation (MC). How do these results hold up in deep RL, which deals with perceptually complex environments and deep nonlinear models? In this paper, we re-examine the role of TD in modern deep RL, using specially designed environments that control for specific factors that affect performance, such as reward sparsity, reward delay, and the perceptual complexity of the task. When comparing TD with infinite-horizon MC, we are able to reproduce classic results in modern settings. Yet we also find that finite-horizon MC is not inferior to TD, even when rewards are sparse or delayed. This makes MC a viable alternative to TD in deep RL.
Faster Deep Q-learning using Neural Episodic Control
Nishio, Daichi, Yamane, Satoshi
The research on deep reinforcement learning which estimates Q-value by deep learning has been attracted the interest of researchers recently. In deep reinforcement learning, it is important to efficiently learn the experiences that an agent has collected by exploring environment. We propose NEC2DQN that improves learning speed of a poor sample efficiency algorithm such as DQN by using good one such as NEC at the beginning of learning. We show it is able to learn faster than Double DQN or N-step DQN in the experiments of Pong.