Reinforcement Learning
Learning When to Drive in Intersections by Combining Reinforcement Learning and Model Predictive Control
Tram, Tommy, Batkovic, Ivo, Ali, Mohammad, Sjöberg, Jonas
Learning When to Drive in Intersections by Combining Reinforcement Learning and Model Predictive Control Tommy Tram 1, 2, 3, Ivo Batkovic 1, 2, 3, Mohammad Ali 1, and Jonas Sj oberg 2 Abstract -- In this paper, we propose a decision making algorithm intended for automated vehicles that negotiate with other possibly non-automated vehicles in intersections. The decision algorithm is separated into two parts: a high-level decision module based on reinforcement learning, and a low-level planning module based on model predictive control. Traffic is simulated with numerous predefined driver behaviors and intentions, and the performance of the proposed decision algorithm was evaluated against another controller . The results show that the proposed decision algorithm yields shorter training episodes and an increased performance in success rate compared to the other controller . Interactions between road users in intersections is a complex problem to solve, making it difficult to address using conventional rule based systems. Many advancements aim to solve this problem by trying to imitate human drivers [1] or predicting what other drivers in traffic are planning to do [2]. In [3], the authors show that by modeling the decision process as a partially observable Markov decision process, the model can account for uncertainty in sensing the environment and [4] showed some probabilistic guarantees when solving the problem using reinforcement learning (RL).
Optimal Attacks on Reinforcement Learning Policies
Russo, Alessio, Proutiere, Alexandre
Control policies, trained using the Deep Reinforcement Learning, have been recently shown to be vulnerable to adversarial attacks introducing even very small perturbations to the policy input. The attacks proposed so far have been designed using heuristics, and build on existing adversarial example crafting techniques used to dupe classifiers in supervised learning. In contrast, this paper investigates the problem of devising optimal attacks, depending on a well-defined attacker's objective, e.g., to minimize the main agent average reward. When the policy and the system dynamics, as well as rewards, are known to the attacker, a scenario referred to as a white-box attack, designing optimal attacks amounts to solving a Markov Decision Process. For what we call black-box attacks, where neither the policy nor the system is known, optimal attacks can be trained using Reinforcement Learning techniques. Through numerical experiments, we demonstrate the efficiency of our attacks compared to existing attacks (usually based on Gradient methods). We further quantify the potential impact of attacks and establish its connection to the smoothness of the policy under attack. Smooth policies are naturally less prone to attacks (this explains why Lipschitz policies, with respect to the state, are more resilient). Finally, we show that from the main agent perspective, the system uncertainties and the attacker can be modeled as a Partially Observable Markov Decision Process. We actually demonstrate that using Reinforcement Learning techniques tailored to POMDP (e.g. using Recurrent Neural Networks) leads to more resilient policies.
Inverse Reinforcement Learning with Multiple Ranked Experts
Castro, Pablo Samuel, Li, Shijian, Zhang, Daqing
We consider the problem of learning to behave optimally in a Markov Decision Process when a reward function is not specified, but instead we have access to a set of demonstrators of varying performance. We assume the demonstrators are classified into one of k ranks, and use ideas from ordinal regression to find a reward function that maximizes the margin between the different ranks. This approach is based on the idea that agents should not only learn how to behave from experts, but also how not to behave from non-experts. We show there are MDPs where important differences in the reward function would be hidden from existing algorithms by the behaviour of the expert. Our method is particularly useful for problems where we have access to a large set of agent behaviours with varying degrees of expertise (such as through GPS or cellphones). We highlight the differences between our approach and existing methods using a simple grid domain and demonstrate its efficacy on determining passenger-finding strategies for taxi drivers, using a large dataset of GPS trajectories.
Multi-Agent Adversarial Inverse Reinforcement Learning
Yu, Lantao, Song, Jiaming, Ermon, Stefano
Reinforcement learning agents are prone to undesired behaviors due to reward mis-specification. Finding a set of reward functions to properly guide agent behaviors is particularly challenging in multi-agent scenarios. Inverse reinforcement learning provides a framework to automatically acquire suitable reward functions from expert demonstrations. Its extension to multi-agent settings, however, is difficult due to the more complex notions of rational behaviors. In this paper, we propose MA-AIRL, a new framework for multi-agent inverse reinforcement learning, which is effective and scalable for Markov games with high-dimensional state-action space and unknown dynamics. We derive our algorithm based on a new solution concept and maximum pseudolikelihood estimation within an adversarial reward learning framework. In the experiments, we demonstrate that MA-AIRL can recover reward functions that are highly correlated with ground truth ones, and significantly outperforms prior methods in terms of policy imitation.
DeepPlace: Learning to Place Applications in Multi-Tenant Clusters
Mitra, Subrata, Mondal, Shanka Subhra, Sheoran, Nikhil, Dhake, Neeraj, Nehra, Ravinder, Simha, Ramanuja
Large multi-tenant production clusters often have to handle a variety of jobs and applications with a variety of complex resource usage characteristics. It is non-trivial and non-optimal to manually create placement rules for scheduling that would decide which applications should co-locate. In this paper, we present DeepPlace, a scheduler that learns to exploits various temporal resource usage patterns of applications using Deep Reinforcement Learning (Deep RL) to reduce resource competition across jobs running in the same machine while at the same time optimizing for overall cluster utilization.
A Unified Bellman Optimality Principle Combining Reward Maximization and Empowerment
Leibfried, Felix, Pascual-Diaz, Sergio, Grau-Moya, Jordi
Empowerment is an information-theoretic method that can be used to intrinsically motivate learning agents. It attempts to maximize an agent's control over the environment by encouraging visiting states with a large number of reachable next states. Empowered learning has been shown to lead to complex behaviors, without requiring an explicit reward signal. In this paper, we investigate the use of empowerment in the presence of an extrinsic reward signal. We hypothesize that empowerment can guide reinforcement learning (RL) agents to find good early behavioral solutions by encouraging highly empowered states. We propose a unified Bellman optimality principle for empowered reward maximization. Our empowered reward maximization approach generalizes both Bellman's optimality principle as well as recent information-theoretical extensions to it. We prove uniqueness of the empowered values and show convergence to the optimal solution. We then apply this idea to develop off-policy actor-critic RL algorithms for high-dimensional continuous domains. We experimentally validate our methods in robotics domains (MuJoCo). Our methods demonstrate improved initial and competitive final performance compared to model-free state-of-the-art techniques.
Reinforcement Learning with Wasserstein Distance Regularisation, with Applications to Multipolicy Learning
Abdullah, Mohammed Amin, Pacchiano, Aldo, Draief, Moez
We describe an application of Wasserstein distance to Reinforcement Learning. The Wasserstein distance in question is between the distribution of mappings of trajectories of a policy into some metric space, and some other fixed distribution (which may, for example, come from another policy). Different policies induce different distributions, so given an underlying metric, the Wasserstein distance quantifies how different policies are. This can be used to learn multiple polices which are different in terms of such Wasserstein distances by using a Wasserstein regulariser. Changing the sign of the regularisation parameter, one can learn a policy for which its trajectory mapping distribution is attracted to a given fixed distribution.
MineRL: A Large-Scale Dataset of Minecraft Demonstrations
Guss, William H., Houghton, Brandon, Topin, Nicholay, Wang, Phillip, Codel, Cayden, Veloso, Manuela, Salakhutdinov, Ruslan
The sample inefficiency of standard deep reinforcement learning methods precludes their application to many real-world problems. Methods which leverage human demonstrations require fewer samples but have been researched less. As demonstrated in the computer vision and natural language processing communities, large-scale datasets have the capacity to facilitate research by serving as an experimental and benchmarking platform for new methods. However, existing datasets compatible with reinforcement learning simulators do not have sufficient scale, structure, and quality to enable the further development and evaluation of methods focused on using human examples. Therefore, we introduce a comprehensive, large-scale, simulator-paired dataset of human demonstrations: MineRL. The dataset consists of over 60 million automatically annotated state-action pairs across a variety of related tasks in Minecraft, a dynamic, 3D, open-world environment. We present a novel data collection scheme which allows for the ongoing introduction of new tasks and the gathering of complete state information suitable for a variety of methods. We demonstrate the hierarchality, diversity, and scale of the MineRL dataset. Further, we show the difficulty of the Minecraft domain along with the potential of MineRL in developing techniques to solve key research challenges within it.
Control of nonlinear, complex and black-boxed greenhouse system with reinforcement learning
Modern control theories such as systems engineering approaches try to solve nonlinear system problems by revelation of causal relationship or co - relationship among the components; most of those approaches focus on control of sophisticatedly modeled white - boxed systems. We suggest an application of actor - critic reinforceme nt learning approach to control a nonlinear, complex and black - boxed system. We demonstrated this approach on artificial green - house environment simulator all of whose control inputs have several side effects so human cannot figure out how to control this system easily. Our approach succeeded to maintain the circumstance at least 20 times longer than PID and Deep Q Learning.
Multi-agent Inverse Reinforcement Learning for Two-person Zero-sum Games
Lin, Xiaomin, Beling, Peter A., Cogill, Randy
The focus of this paper is a Bayesian framework for solving a class of problems termed multi-agent inverse reinforcement learning (MIRL). Compared to the well-known inverse reinforcement learning (IRL) problem, MIRL is formalized in the context of stochastic games, which generalize Markov decision processes to game theoretic scenarios. We establish a theoretical foundation for competitive two-agent zero-sum MIRL problems and propose a Bayesian solution approach in which the generative model is based on an assumption that the two agents follow a minimax bi-policy. Numerical results are presented comparing the Bayesian MIRL method with two existing methods in the context of an abstract soccer game. Investigation centers on relationships between the extent of prior information and the quality of learned rewards. Results suggest that covariance structure is more important than mean value in reward priors.