Reinforcement Learning
Benchmarking Batch Deep Reinforcement Learning Algorithms
Fujimoto, Scott, Conti, Edoardo, Ghavamzadeh, Mohammad, Pineau, Joelle
Widely-used deep reinforcement learning algorithms have been shown to fail in the batch setting--learning from a fixed data set without interaction with the environment. Following this result, there have been several papers showing reasonable performances under a variety of environments and batch settings. In this paper, we benchmark the performance of recent off-policy and batch reinforcement learning algorithms under unified settings on the Atari domain, with data generated by a single partially-trained behavioral policy. We find that under these conditions, many of these algorithms underperform DQN trained online with the same amount of data, as well as the partially-trained behavioral policy. To introduce a strong baseline, we adapt the Batch-Constrained Q-learning algorithm to a discrete-action setting, and show it outperforms all existing algorithms at this task.
Best Deep Reinforcement Learning Research of 2019 So Far
The scale of Internet-connected systems has increased considerably, and these systems are being exposed to cyberattacks more than ever. The complexity and dynamics of cyberattacks require protecting mechanisms to be responsive, adaptive, and large-scale. Machine learning, or more specifically DRL, methods have been proposed widely to address these issues. By incorporating deep learning into traditional RL, DRL is highly capable of solving complex, dynamic, and especially high-dimensional cyber defense problems. This paper presents a survey of DRL approaches developed for cyber security.
Ready, Set, Algorithms! Teams Learn AI by Racing Cars
Anyone with an Amazon Web Services account can participate in the league. Teams or individuals can compete online in "virtual" races or in person at events world-wide. Teams build and train AI algorithms using Amazon SageMaker software, deploy them to self-driving cars measuring about 10 inches, then race them around a track of roughly 17 feet by 26 feet. "It's actually having practical applications," said James Rhodes, chief technology officer of investment research firm Morningstar. Thanks to the training, the company expects to have dozens of projects based on reinforcement learning and other machine-learning techniques in deployment by the end of 2020, he said.
Analyzing the Variance of Policy Gradient Estimators for the Linear-Quadratic Regulator
Preiss, James A., Arnold, Sébastien M. R., Wei, Chen-Yu, Kloft, Marius
We study the variance of the REINFORCE policy gradient estimator in environments with continuous state and action spaces, linear dynamics, quadratic cost, and Gaussian noise. These simple environments allow us to derive bounds on the estimator variance in terms of the environment and noise parameters. We compare the predictions of our bounds to the empirical variance in simulation experiments.
Formal Language Constraints for Markov Decision Processes
Quint, Eleanor, Xu, Dong, Dogan, Haluk, Hakguder, Zeynep, Scott, Stephen, Dwyer, Matthew
In order to satisfy safety conditions, a reinforcement learned (RL) agent maybe constrained from acting freely, e.g., to prevent trajectories that might cause unwanted behavior or physical damage in a robot. We propose a general framework for augmenting a Markov decision process (MDP) with constraints that are described in formal languages over sequences of MDP states and agent actions. Constraint enforcement is implemented by filtering the allowed action set or by applying potential-based reward shaping to implement hard and soft constraint enforcement, respectively. We instantiate this framework using deterministic finite automata to encode constraints and propose methods of augmenting MDP observations with the state of the constraint automaton for learning. We empirically evaluate these methods with a variety of constraints by training Deep Q-Networks in Atari games as well as Proximal Policy Optimization in MuJoCo environments. We experimentally find that our approaches are effective in significantly reducing or eliminating constraint violations with either minimal negative or, depending on the constraint, a clear positive impact on final performance.
Unsupervised Doodling and Painting with Improved SPIRAL
Mellor, John F. J., Park, Eunbyung, Ganin, Yaroslav, Babuschkin, Igor, Kulkarni, Tejas, Rosenbaum, Dan, Ballard, Andy, Weber, Theophane, Vinyals, Oriol, Eslami, S. M. Ali
We investigate using reinforcement learning agents as generative models of images (extending arXiv:1804.01118). A generative agent controls a simulated painting environment, and is trained with rewards provided by a discriminator network simultaneously trained to assess the realism of the agent's samples, either unconditional or reconstructions. Compared to prior work, we make a number of improvements to the architectures of the agents and discriminators that lead to intriguing and at times surprising results. We find that when sufficiently constrained, generative agents can learn to produce images with a degree of visual abstraction, despite having only ever seen real photographs (no human brush strokes). And given enough time with the painting environment, they can produce images with considerable realism. These results show that, under the right circumstances, some aspects of human drawing can emerge from simulated embodiment, without the need for external supervision, imitation or social cues. Finally, we note the framework's potential for use in creative applications.
Natural Language State Representation for Reinforcement Learning
Schwartz, Erez, Tennenholtz, Guy, Tessler, Chen, Mannor, Shie
Recent advances in Reinforcement Learning have highlighted the difficulties in learning within complex high dimensional domains. We argue that one of the main reasons that current approaches do not perform well, is that the information is represented sub-optimally. A natural way to describe what we observe, is through natural language. In this paper, we implement a natural language state representation to learn and complete tasks. Our experiments suggest that natural language based agents are more robust, converge faster and perform better than vision based agents, showing the benefit of using natural language representations for Reinforcement Learning.
Deep Reinforcement Learning for Single-Shot Diagnosis and Adaptation in Damaged Robots
Verma, Shresth, Nair, Haritha S., Agarwal, Gaurav, Dhar, Joydip, Shukla, Anupam
Robotics has proved to be an indispensable tool in many industrial as well as social applications, such as warehouse automation, manufacturing, disaster robotics, etc. In most of these scenarios, damage to the agent while accomplishing mission-critical tasks can result in failure. To enable robotic adaptation in such situations, the agent needs to adopt policies which are robust to a diverse set of damages and must do so with minimum computational complexity. We thus propose a damage aware control architecture which diagnoses the damage prior to gait selection while also incorporating domain randomization in the damage space for learning a robust policy. To implement damage awareness, we have used a Long Short Term Memory based supervised learning network which diagnoses the damage and predicts the type of damage. The main novelty of this approach is that only a single policy is trained to adapt against a wide variety of damages and the diagnosis is done in a single trial at the time of damage.
Task-Relevant Adversarial Imitation Learning
Zolna, Konrad, Reed, Scott, Novikov, Alexander, Colmenarej, Sergio Gomez, Budden, David, Cabi, Serkan, Denil, Misha, de Freitas, Nando, Wang, Ziyu
We show that a critical problem in adversarial imitation from high-dimensional sensory data is the tendency of discriminator networks to distinguish agent and expert behaviour using task-irrelevant features beyond the control of the agent. We analyze this problem in detail and propose a solution as well as several baselines that outperform standard Generative Adversarial Imitation Learning (GAIL). Our proposed solution, Task-Relevant Adversarial Imitation Learning (TRAIL), uses a constrained optimization objective to overcome task-irrelevant features. Comprehensive experiments show that TRAIL can solve challenging manipulation tasks from pixels by imitating human operators, where other agents such as behaviour cloning (BC), standard GAIL, improved GAIL variants including our newly proposed baselines, and Deterministic Policy Gradients from Demonstrations (DPGfD) fail to find solutions, even when the other agents have access to task reward.
Stabilizing Off-Policy Reinforcement Learning with Conservative Policy Gradients
Tessler, Chen, Merlis, Nadav, Mannor, Shie
In recent years, advances in deep learning have enabled the application of reinforcement learning algorithms in complex domains. However, they lack the theoretical guarantees which are present in the tabular setting and suffer from many stability and reproducibility problems \citep{henderson2018deep}. In this work, we suggest a simple approach for improving stability and providing probabilistic performance guarantees in off-policy actor-critic deep reinforcement learning regimes. Experiments on continuous action spaces, in the MuJoCo control suite, show that our proposed method reduces the variance of the process and improves the overall performance.