Reinforcement Learning
lexfridman/deeptraffic
DeepTraffic is a deep reinforcement learning competition hosted as part of the MIT Deep Learning courses. The goal is to create a neural network that drives a vehicle (or multiple vehicles) as fast as possible through dense highway traffic. Top 10 submissions are listed on the leaderboard and you'll be able to visualize your submission in the following way: To get started right away, this repository provides a code snippet to insert into the code box on the DeepTraffic site. We'll add additional agents as the course progresses: A basic network that achieves 66.8mph. And now let's return to the problem of traffic: "Americans will put up with anything provided it doesn't block traffic." - Dan Rather In the U.S. alone, we spend 6.9 billion hours sitting in traffic each year [1] -- roughly 10,000 human lifetimes [2]. Autonomous vehicles will be able to alleviate part (but not all) of the problem.
Information Theoretic Model Predictive Q-Learning
Bhardwaj, Mohak, Handa, Ankur, Fox, Dieter, Boots, Byron
Model-free Reinforcement Learning (RL) algorithms work well in sequential decision-making problems when experience can be collected cheaply and model-based RL is effective when system dynamics can be modeled accurately. However, both of these assumptions can be violated in real world problems such as robotics, where querying the system can be prohibitively expensive and real-world dynamics can be difficult to model accurately. Although sim-to-real approaches such as domain randomization attempt to mitigate the effects of biased simulation, they can still suffer from optimization challenges such as local minima and hand-designed distributions for randomization, making it difficult to learn an accurate global value function or policy that directly transfers to the real world. In contrast to RL, Model Predictive Control (MPC) algorithms use a simulator to optimize a simple policy class online, constructing a closed-loop controller that can effectively contend with real-world dynamics. MPC performance is usually limited by factors such as model bias and the limited horizon of optimization. In this work, we present a novel theoretical connection between information theoretic MPC and entropy regularized RL and develop a Q-learning algorithm that can leverage biased models. We validate the proposed algorithm on sim-to-sim control tasks to demonstrate the improvements over optimal control and reinforcement learning from scratch. Our approach paves the way for deploying reinforcement learning algorithms on real-robots in a systematic manner.
World Programs for Model-Based Learning and Planning in Compositional State and Action Spaces
Some of the most important tasks take place in environments which lack cheap and perfect simulators, thus hampering the application of model-free reinforcement learning (RL). While model-based RL aims to learn a dynamics model, in a more general case the learner does not know a priori what the action space is. Here we propose a formalism where the learner induces a world program by learning a dynamics model and the actions in graph-based compositional environments by observing state-state transition examples. Then, the learner can perform RL with the world program as the simulator for complex planning tasks. We highlight a recent application, and propose a challenge for the community to assess world program-based planning.
A New Framework for Query Efficient Active Imitation Learning
We seek to align agent policy with human expert behavior in a reinforcement learning (RL) setting, without any prior knowledge about dynamics, reward function, and unsafe states. There is a human expert knowing the rewards and unsafe states based on his preference and objective, but querying that human expert is expensive. To address this challenge, we propose a new framework for imitation learning (IL) algorithm that actively and interactively learns a model of the user's reward function with efficient queries. We build an adversarial generative model of states and a successor feature (SR) model trained over transition experience collected by learning policy. Our method uses these models to select state-action pairs, asking the user to comment on the optimality or safety, and trains a adversarial neural network to predict the rewards. Different from previous papers, which are almost all based on uncertainty sampling, the key idea is to actively and efficiently select state-action pairs from both on-policy and off-policy experience, by discriminating the queried (expert) and unqueried (generated) data and maximizing the efficiency of value function learning. We call this method adversarial reward query with successor representation. We evaluate the proposed method with simulated human on a state-based 2D navigation task, robotic control tasks and the image-based video games, which have high-dimensional observation and complex state dynamics. The results show that the proposed method significantly outperforms uncertainty-based methods on learning reward models, achieving better query efficiency, where the adversarial discriminator can make the agent learn human behavior more efficiently and the SR can select states which have stronger impact on value function. Moreover, the proposed method can also learn to avoid unsafe states when training the reward model.
Computational model discovery with reinforcement learning
Bassenne, Maxime, Lozano-Durán, Adrián
The motivation of this study is to leverage recent breakthroughs in artificial intelligence research to unlock novel solutions to important scientific problems encountered in computational science. To address the human intelligence limitations in discovering reduced-order models, we propose to supplement human thinking with artificial intelligence. Our three-pronged strategy consists of learning (i) models expressed in analytical form, (ii) which are evaluated a posteriori, and iii) using exclusively integral quantities from the reference solution as prior knowledge. In point (i), we pursue interpretable models expressed symbolically as opposed to black-box neural networks, the latter only being used during learning to efficiently parameterize the large search space of possible models. In point (ii), learned models are dynamically evaluated a posteriori in the computational solver instead of based on a priori information from preprocessed high-fidelity data, thereby accounting for the specificity of the solver at hand such as its numerics. Finally in point (iii), the exploration of new models is solely guided by predefined integral quantities, e.g., averaged quantities of engineering interest in Reynolds-averaged or large-eddy simulations (LES). We use a coupled deep reinforcement learning framework and computational solver to concurrently achieve these objectives. The combination of reinforcement learning with objectives (i), (ii) and (iii) differentiate our work from previous modeling attempts based on machine learning. In this report, we provide a high-level description of the model discovery framework with reinforcement learning. The method is detailed for the application of discovering missing terms in differential equations. An elementary instantiation of the method is described that discovers missing terms in the Burgers' equation.
Augmented Replay Memory in Reinforcement Learning With Continuous Control
Ramicic, Mirza, Bonarini, Andrea
Online reinforcement learning agents are currently able to process an increasing amount of data by converting it into a higher order value functions. This expansion of the information collected from the environment increases the agent's state space enabling it to scale up to a more complex problems but also increases the risk of forgetting by learning on redundant or conflicting data. To improve the approximation of a large amount of data, a random mini-batch of the past experiences that are stored in the replay memory buffer is often replayed at each learning step. The proposed work takes inspiration from a biological mechanism which act as a protective layer of human brain higher cognitive functions: active memory consolidation mitigates the effect of forgetting of previous memories by dynamically processing the new ones. The similar dynamics are implemented by a proposed augmented memory replay AMR capable of optimizing the replay of the experiences from the agent's memory structure by altering or augmenting their relevance. Experimental results show that an evolved AMR augmentation function capable of increasing the significance of the specific memories is able to further increase the stability and convergence speed of the learning algorithms dealing with the complexity of continuous action domains.
Individual specialization in multi-task environments with multiagent reinforcement learners
Gasparrini, Marco Jerome, Solé, Ricard, Sánchez-Fibla, Martí
There is a growing interest in Multi-Agent Reinforcement Learning (MARL) as the first steps towards building general intelligent agents that learn to make low and high-level decisions in non-stationary complex environments in the presence of other agents. Previous results point us towards increased conditions for coordination, efficiency/fairness, and common-pool resource sharing. We further study coordination in multi-task environments where several rewarding tasks can be performed and thus agents don't necessarily need to perform well in all tasks, but under certain conditions may specialize. An observation derived from the study is that epsilon greedy exploration of value-based reinforcement learning methods is not adequate for multi-agent independent learners because the epsilon parameter that controls the probability of selecting a random action synchronizes the agents artificially and forces them to have deterministic policies at the same time. By using policy-based methods with independent entropy regularised exploration updates, we achieved a better and smoother convergence. Another result that needs to be further investigated is that with an increased number of agents specialization tends to be more probable.
Loss aversion fosters coordination among independent reinforcement learners
Gasparrini, Marco Jerome, Sánchez-Fibla, Martí
L OSS AVERSION FOSTERS COORDINATION AMONG INDEPENDENT REINFORCEMENT LEARNERS ARX IV VERSION Marco Jerome Gasparrini University Pompeu Fabra Barcelona, Spain Martí Sánchez-Fibla † University Pompeu Fabra Barcelona, Spain January 1, 2020 A BSTRACT We study what are the factors that can accelerate the emergence of collaborative behaviours among independent selfish learning agents. We depart from the "Battle of the Exes" (BoE), a spatial repeated game from which human behavioral data has been obtained (by Hawkings and Goldstone, 2016) that we find interesting because it considers two cases: a classic game theory version, called ballistic, in which agents can only make one action/decision (equivalent to the Battle of the Sexes) and a spatial version, called dynamic, in which agents can change decision (a spatial continuous version). We model both versions of the game with independent reinforcement learning agents and we manipulate the reward function transforming it into an utility introducing "loss aversion": the reward that an agent obtains can be perceived as less valuable when compared to what the other got. We prove experimentally the introduction of loss aversion fosters cooperation by accelerating its appearance, and by making it possible in some cases like in the dynamic condition. We suggest that this may be an important factor explaining the rapid converge of human behaviour towards collaboration reported in the experiment of Hawkings and Goldstone.
Real-time Policy Distillation in Deep Reinforcement Learning
Policy distillation in deep reinforcement learning provides an effective way to transfer control policies from a larger network to a smaller untrained network without a significant degradation in performance. However, policy distillation is underexplored in deep reinforcement learning, and existing approaches are computationally inefficient, resulting in a long distillation time. In addition, the effectiveness of the distillation process is still limited to the model capacity. We propose a new distillation mechanism, called real-time policy distillation, in which training the teacher model and distilling the policy to the student model occur simultaneously. Accordingly, the teacher's latest policy is transferred to the student model in real time. This reduces the distillation time to half the original time or even less and also makes it possible for extremely small student models to learn skills at the expert level. We evaluated the proposed algorithm in the Atari 2600 domain. The results show that our approach can achieve full distillation in most games, even with compression ratios up to 1.7%.
Speeding up reinforcement learning by combining attention and agency features
Demirel, Berkay, Sánchez-Fibla, Martí
When playing video-games we immediately detect which entity we control and we center the attention towards it to focus the learning and reduce its dimensionality. Reinforcement Learning (RL) has been able to deal with big state spaces, including states derived from pixel images in Atari games, but the learning is slow, depends on the brute force mapping from the global state to the action values (Q-function), thus its performance is severely affected by the dimensionality of the state and cannot be transferred to other games or other parts of the same game. We propose different transformations of the input state that combine attention and agency detection mechanisms which both have been addressed separately in RL but not together to our knowledge. We propose and benchmark different architectures including both global and local agency centered versions of the state and also including summaries of the surroundings. Results suggest that even a redundant global-local state network can learn faster than the global alone. Summarized versions of the state look promising to achieve input-size independence learning.