Reinforcement Learning
Stabilising Experience Replay for Deep Multi-Agent Reinforcement Learning
Foerster, Jakob, Nardelli, Nantas, Farquhar, Gregory, Afouras, Triantafyllos, Torr, Philip H. S., Kohli, Pushmeet, Whiteson, Shimon
Many real-world problems, such as network packet routing and urban traffic control, are naturally modeled as multi-agent reinforcement learning (RL) problems. However, existing multi-agent RL methods typically scale poorly in the problem size. Therefore, a key challenge is to translate the success of deep learning on single-agent RL to the multi-agent setting. A major stumbling block is that independent Q-learning, the most popular multi-agent RL method, introduces nonstationarity that makes it incompatible with the experience replay memory on which deep Q-learning relies. This paper proposes two methods that address this problem: 1) using a multi-agent variant of importance sampling to naturally decay obsolete data and 2) conditioning each agent's value function on a fingerprint that disambiguates the age of the data sampled from the replay memory. Results on a challenging decentralised variant of StarCraft unit micromanagement confirm that these methods enable the successful combination of experience replay with multi-agent RL.
Artificial Intelligence: Reinforcement Learning in Python
When people talk about artificial intelligence, they usually don't mean supervised and unsupervised machine learning. These tasks are pretty trivial compared to what we think of AIs doing - playing chess and Go, driving cars, and beating video games at a superhuman level. Reinforcement learning has recently become popular for doing all of that and more. Much like deep learning, a lot of the theory was discovered in the 70s and 80s but it hasn't been until recently that we've been able to observe first hand the amazing results that are possible. In 2016 we saw Google's AlphaGo beat the world Champion in Go.
Meta learning Framework for Automated Driving
Sallab, Ahmad El, Saeed, Mahmoud, Tawab, Omar Abdel, Abdou, Mohammed
The success of automated driving deployment is highly depending on the ability to develop an efficient and safe driving policy. The problem is well formulated under the framework of optimal control as a cost optimization problem. Model based solutions using traditional planning are efficient, but require the knowledge of the environment model. On the other hand, model free solutions suffer sample inefficiency and require too many interactions with the environment, which is infeasible in practice. Methods under the Reinforcement Learning framework usually require the notion of a reward function, which is not available in the real world. Imitation learning helps in improving sample efficiency by introducing prior knowledge obtained from the demonstrated behavior, on the risk of exact behavior cloning without generalizing to unseen environments. In this paper we propose a Meta learning framework, based on data set aggregation, to improve generalization of imitation learning algorithms. Under the proposed framework, we propose MetaDAgger, a novel algorithm which tackles the generalization issues in traditional imitation learning. We use The Open Race Car Simulator (TORCS) to test our algorithm. Results on unseen test tracks show significant improvement over traditional imitation learning algorithms, improving the learning time and sample efficiency in the same time. The results are also supported by visualization of the learnt features to prove generalization of the captured details.
Symmetry Learning for Function Approximation in Reinforcement Learning
Mahajan, Anuj, Tulabandhula, Theja
Reinforcement Learning (RL) is the task of training an agent to perform optimally in an environment using the reward and observation signals perceived upon taking actions which change the environment dynamics. Learning optimal behavior is inherently difficult because of challenges like credit assignment and exploration-exploitation trade offs that need to be made while converging to a solution. In many scenarios, like training a rover to move on a Martian surface, the cost of obtaining samples for learning can be high (in terms of robot's energy expenditure etc.), and so sample efficiency is an important subproblem which deserves special attention. Very often it is the case that the environment has intrinsic symmetries which can be leveraged by the agent to improve performance and learn more efficiently. For example, in the Cart-Pole domain [1, 2] the state action space is symmetric with respect to reflection about the plane perpendicular to the direction of motion of the cart (Figure 1). In fact, in many environments, the number of symmetry relations tend to increase with the dimensionality of the state space. For instance, for the simple case of grid world of dimension d (Figure 1) there exist O(d!2
Stochastic Variance Reduction Methods for Policy Evaluation
Du, Simon S., Chen, Jianshu, Li, Lihong, Xiao, Lin, Zhou, Dengyong
Policy evaluation is a crucial step in many reinforcement-learning procedures, which estimates a value function that predicts states' long-term value under a given policy. In this paper, we focus on policy evaluation with linear function approximation over a fixed dataset. We first transform the empirical policy evaluation problem into a (quadratic) convex-concave saddle point problem, and then present a primal-dual batch gradient method, as well as two stochastic variance reduction methods for solving the problem. These algorithms scale linearly in both sample size and feature dimension. Moreover, they achieve linear convergence even when the saddle-point problem has only strong concavity in the dual variables but no strong convexity in the primal variables. Numerical experiments on benchmark problems demonstrate the effectiveness of our methods.
aleju/self-driving-truck
This repository contains code to train and run a self-driving truck in Euro Truck Simulator 2. The resulting AI will automatically steer, accelerate and brake. It is trained (mostly) via reinforcement learning and only has access to the buttons W, A, S and D (i.e. it can not directly set the steering wheel angle). The basic training method follows the standard reinforcement learning approach from the original Atari paper. Additionally, a separation of Q-values in V (value) and A (advantage) - as described in Dueling Network Architectures for Deep Reinforcement Learning - is used. Further, the model tries to predict future states and rewards, similar to the description in Deep Successor Reinforcement Learning.
Efficient Reinforcement Learning via Initial Pure Exploration
Putta, Sudeep Raja, Tulabandhula, Theja
In several realistic situations, an interactive learning agent can practice and refine its strategy before going on to be evaluated. For instance, consider a student preparing for a series of tests. She would typically take a few practice tests to know which areas she needs to improve upon. Based of the scores she obtains in these practice tests, she would formulate a strategy for maximizing her scores in the actual tests. We treat this scenario in the context of an agent exploring a fixed-horizon episodic Markov Decision Process (MDP), where the agent can practice on the MDP for some number of episodes (not necessarily known in advance) before starting to incur regret for its actions. During practice, the agent's goal must be to maximize the probability of following an optimal policy. This is akin to the problem of Pure Exploration (PE). We extend the PE problem of Multi Armed Bandits (MAB) to MDPs and propose a Bayesian algorithm called Posterior Sampling for Pure Exploration (PSPE), which is similar to its bandit counterpart. We show that the Bayesian simple regret converges at an optimal exponential rate when using PSPE. When the agent starts being evaluated, its goal would be to minimize the cumulative regret incurred. This is akin to the problem of Reinforcement Learning (RL). The agent uses the Posterior Sampling for Reinforcement Learning algorithm (PSRL) initialized with the posteriors of the practice phase. We hypothesize that this PSPE + PSRL combination is an optimal strategy for minimizing regret in RL problems with an initial practice phase. We show empirical results which prove that having a lower simple regret at the end of the practice phase results in having lower cumulative regret during evaluation.
Project Magenta: Music and Art with Machine Learning (Google I/O '17)
Google Brain researcher Douglas Eck will discuss Magenta, a project using TensorFlow to generate art and music with deep nets and reinforcement learning. He'll also talk about how artists and musicians fit in to the effort. We'll dive into some of the technical details and challenges faced in building generative models, but no machine learning expertise is required to follow the session. See all the talks from Google I/O '17 here: https://goo.gl/D0D4VE Subscribe to the Google Developers channel: http://goo.gl/mQyv5L
Intrinsically motivated model learning for developing curious robots
Reinforcement Learning (RL) agents are typically deployed to learn a specific, concrete task based on a pre-defined reward function. However, in some cases an agent may be able to gain experience in the domain prior to being given a task. In such cases, intrinsic motivation can be used to enable the agent to learn a useful model of the environment that is likely to help it learn its eventual tasks more efficiently. This paradigm fits robots particularly well, as they need to learn about their own dynamics and affordances which can be applied to many different tasks. The algorithm learns models of the transition dynamics of a domain using random forests.
[R] [1705.06366] Automatic Goal Generation for Reinforcement Learning Agents • r/MachineLearning
I wanted to like this paper. Curriculum learning is an area that needs more research, and automating the curriculum process is a good idea. However, the use of a GAN here is completely overkill -- the space of goals here is low dimensional and relatively unstructured. I'd wager any generative modeling technique would've worked for their experiments.