Reinforcement Learning
Shared autonomy via deep reinforcement learning
Imagine a drone pilot remotely flying a quadrotor, using an onboard camera to navigate and land. Unfamiliar flight dynamics, terrain, and network latency can make this system challenging for a human to control. One approach to this problem is to train an autonomous agent to perform tasks like patrolling and mapping without human intervention. This strategy works well when the task is clearly specified and the agent can observe all the information it needs to succeed. Unfortunately, many real-world applications that involve human users do not satisfy these conditions: the user's intent is often private information that the agent cannot directly access, and the task may be too complicated for the user to precisely define.
Lessons Learned Reproducing a Deep Reinforcement Learning Paper
There are a lot of neat things going on in deep reinforcement learning. One of the coolest things from last year was OpenAI and DeepMind's work on training an agent using feedback from a human rather than a classical reward signal. There's a great blog post about it at Learning from Human Preferences, and the original paper is at Deep Reinforcement Learning from Human Preferences. I've seen a few recommendations that reproducing papers is a good way of levelling up machine learning skills, and I decided this could be an interesting one to try with. It was indeed a super fun project, and I'm happy to have tackled it - but looking back, I realise it wasn't exactly the experience I thought it would be. If you're thinking about reproducing papers too, here are some notes on what surprised me about working with deep RL.
No Metrics Are Perfect: Adversarial Reward Learning for Visual Storytelling
Wang, Xin, Chen, Wenhu, Wang, Yuan-Fang, Wang, William Yang
Though impressive results have been achieved in visual captioning, the task of generating abstract stories from photo streams is still a little-tapped problem. Different from captions, stories have more expressive language styles and contain many imaginary concepts that do not appear in the images. Thus it poses challenges to behavioral cloning algorithms. Furthermore, due to the limitations of automatic metrics on evaluating story quality, reinforcement learning methods with hand-crafted rewards also face difficulties in gaining an overall performance boost. Therefore, we propose an Adversarial REward Learning (AREL) framework to learn an implicit reward function from human demonstrations, and then optimize policy search with the learned reward function. Though automatic evaluation indicates slight performance boost over state-of-the-art (SOTA) methods in cloning expert behaviors, human evaluation shows that our approach achieves significant improvement in generating more human-like stories than SOTA systems.
An introduction to Reinforcement Learning โ freeCodeCamp
Reinforcement learning is an important type of Machine Learning where an agent learn how to behave in a environment by performing actions and seeing the results. In recent years, we've seen a lot of improvements in this fascinating area of research. Examples include DeepMind and the Deep Q learning architecture in 2014, beating the champion of the game of Go with AlphaGo in 2016, OpenAI and the PPO in 2017, amongst others. In this series of articles, we will focus on learning the different architectures used today to solve Reinforcement Learning problems. These will include Q -learning, Deep Q-learning, Policy Gradients, Actor Critic, and PPO.
Distributed Distributional Deterministic Policy Gradients
Barth-Maron, Gabriel, Hoffman, Matthew W., Budden, David, Dabney, Will, Horgan, Dan, TB, Dhruva, Muldal, Alistair, Heess, Nicolas, Lillicrap, Timothy
This work adopts the very successful distributional perspective on reinforcement learning and adapts it to the continuous control setting. We combine this within a distributed framework for off-policy learning in order to develop what we call the Distributed Distributional Deep Deterministic Policy Gradient algorithm, D4PG. We also combine this technique with a number of additional, simple improvements such as the use of $N$-step returns and prioritized experience replay. Experimentally we examine the contribution of each of these individual components, and show how they interact, as well as their combined contributions. Our results show that across a wide variety of simple control tasks, difficult manipulation tasks, and a set of hard obstacle-based locomotion tasks the D4PG algorithm achieves state of the art performance.
Gotta Learn Fast: A New Benchmark for Generalization in RL
Nichol, Alex, Pfau, Vicki, Hesse, Christopher, Klimov, Oleg, Schulman, John
In this report, we present a new reinforcement learning (RL) benchmark based on the Sonic the Hedgehog (TM) video game franchise. This benchmark is intended to measure the performance of transfer learning and few-shot learning algorithms in the RL domain. We also present and evaluate some baseline algorithms on the new benchmark.
State Distribution-aware Sampling for Deep Q-learning
Li, Weichao, Huang, Fuxian, Li, Xi, Pan, Gang, Wu, Fei
A critical and challenging problem in reinforcement learning is how to learn the state-action value function from the experience replay buffer and simultaneously keep sample efficiency and faster convergence to a high quality solution. In prior works, transitions are uniformly sampled at random from the replay buffer or sampled based on their priority measured by temporal-difference (TD) error. However, these approaches do not fully take into consideration the intrinsic characteristics of transition distribution in the state space and could result in redundant and unnecessary TD updates, slowing down the convergence of the learning procedure. To overcome this problem, we propose a novel state distribution-aware sampling method to balance the replay times for transitions with skew distribution, which takes into account both the occurrence frequencies of transitions and the uncertainty of state-action values. Consequently, our approach could reduce the unnecessary TD updates and increase the TD updates for state-action value with more uncertainty, making the experience replay more effective and efficient. Extensive experiments are conducted on both classic control tasks and Atari 2600 games based on OpenAI gym platform and the experimental results demonstrate the effectiveness of our approach in comparison with the standard DQN approach.
Benchmarking projective simulation in navigation problems
Melnikov, Alexey A., Makmal, Adi, Briegel, Hans J.
Projective simulation (PS) is a model for intelligent agents with a deliberation capacity that is based on episodic memory. The model has been shown to provide a flexible framework for constructing reinforcement-learning agents, and it allows for quantum mechanical generalization, which leads to a speed-up in deliberation time. PS agents have been applied successfully in the context of complex skill learning in robotics, and in the design of state-of-the-art quantum experiments. In this paper, we study the performance of projective simulation in two benchmarking problems in navigation, namely the grid world and the mountain car problem. The performance of PS is compared to standard tabular reinforcement learning approaches, Q-learning and SARSA. Our comparison demonstrates that the performance of PS and standard learning approaches are qualitatively and quantitatively similar, while it is much easier to choose optimal model parameters in case of projective simulation, with a reduced computational effort of one to two orders of magnitude. Our results show that the projective simulation model stands out for its simplicity in terms of the number of model parameters, which makes it simple to set up the learning agent in unknown task environments.
Towards Symbolic Reinforcement Learning with Common Sense
Garcez, Artur d'Avila, Dutra, Aimore Resende Riquetti, Alonso, Eduardo
Deep Reinforcement Learning (deep RL) has made several breakthroughs in recent years in applications ranging from complex control tasks in unmanned vehicles to game playing. Despite their success, deep RL still lacks several important capacities of human intelligence, such as transfer learning, abstraction and interpretability. Deep Symbolic Reinforcement Learning (DSRL) seeks to incorporate such capacities to deep Q-networks (DQN) by learning a relevant symbolic representation prior to using Q-learning. In this paper, we propose a novel extension of DSRL, which we call Symbolic Reinforcement Learning with Common Sense (SRL CS), offering a better balance between generalization and specialization, inspired by principles of common sense when assigning rewards and aggregating Q-values. Experiments reported in this paper show that SRL CS learns consistently faster than Q-learning and DSRL, achieving also a higher accuracy. In the hardest case, where agents were trained in a deterministic environment and tested in a random environment, SRL CS achieves nearly 100% average accuracy compared to DSRL's 70% and DQN's 50% accuracy. To the best of our knowledge, this is the first case of near perfect zero-shot transfer learning using Reinforcement Learning.
Formulation of Deep Reinforcement Learning Architecture Toward Autonomous Driving for On-Ramp Merge
Multiple automakers have in development or in production automated driving systems (ADS) that offer freeway-pilot functions. This type of ADS is typically limited to restricted-access freeways only, that is, the transition from manual to automated modes takes place only after the ramp merging process is completed manually. One major challenge to extend the automation to ramp merging is that the automated vehicle needs to incorporate and optimize long-term objectives (e.g. successful and smooth merge) when near-term actions must be safely executed. Moreover, the merging process involves interactions with other vehicles whose behaviors are sometimes hard to predict but may influence the merging vehicle optimal actions. To tackle such a complicated control problem, we propose to apply Deep Reinforcement Learning (DRL) techniques for finding an optimal driving policy by maximizing the long-term reward in an interactive environment. Specifically, we apply a Long Short-Term Memory (LSTM) architecture to model the interactive environment, from which an internal state containing historical driving information is conveyed to a Deep Q-Network (DQN). The DQN is used to approximate the Q-function, which takes the internal state as input and generates Q-values as output for action selection. With this DRL architecture, the historical impact of interactive environment on the long-term reward can be captured and taken into account for deciding the optimal control policy. The proposed architecture has the potential to be extended and applied to other autonomous driving scenarios such as driving through a complex intersection or changing lanes under varying traffic flow conditions.