Reinforcement Learning
DeepMind says it's given AI an imagination. Let's take a closer look at that
Google's AI boutique, DeepMind, known for dispelling human delusions of intellectual superiority by soundly beating the world's top Go players with computer code, has found that instilling its software agents with something like imagination helps them learn better. In two papers published this week โ "Imagination-Augmented Agents for Deep Reinforcement Learning" and "Learning model-based planning from scratch" โ the AI biz's brain boffins, based in Britain, describe novel techniques for improving deep reinforcement learning through what can generously be described as imaginative planning. Reinforcement learning is a form of machine learning. It involves a software agent that learns by interacting with a specific environment, usually through trial and error. Deep learning is a form of machine that involves algorithms inspired by the human brain, called neural networks.
A Distributional Perspective on Reinforcement Learning
Bellemare, Marc G., Dabney, Will, Munos, Rรฉmi
In this paper we argue for the fundamental importance of the value distribution: the distribution of the random return received by a reinforcement learning agent. This is in contrast to the common approach to reinforcement learning which models the expectation of this return, or value. Although there is an established body of literature studying the value distribution, thus far it has always been used for a specific purpose such as implementing risk-aware behaviour. We begin with theoretical results in both the policy evaluation and control settings, exposing a significant distributional instability in the latter. We then use the distributional perspective to design a new algorithm which applies Bellman's equation to the learning of approximate value distributions. We evaluate our algorithm using the suite of games from the Arcade Learning Environment. We obtain both state-of-the-art results and anecdotal evidence demonstrating the importance of the value distribution in approximate reinforcement learning. Finally, we combine theoretical and empirical evidence to highlight the ways in which the value distribution impacts learning in the approximate setting.
Virtual-to-real Deep Reinforcement Learning: Continuous Control of Mobile Robots for Mapless Navigation
Tai, Lei, Paolo, Giuseppe, Liu, Ming
We present a learning-based mapless motion planner by taking the sparse 10-dimensional range findings and the target position with respect to the mobile robot coordinate frame as input and the continuous steering commands as output. Traditional motion planners for mobile ground robots with a laser range sensor mostly depend on the obstacle map of the navigation environment where both the highly precise laser sensor and the obstacle map building work of the environment are indispensable. We show that, through an asynchronous deep reinforcement learning method, a mapless motion planner can be trained end-to-end without any manually designed features and prior demonstrations. The trained planner can be directly applied in unseen virtual and real environments. The experiments show that the proposed mapless motion planner can navigate the nonholonomic mobile robot to the desired targets without colliding with any obstacles.
Reinforcement Learning with Deep Energy-Based Policies
Haarnoja, Tuomas, Tang, Haoran, Abbeel, Pieter, Levine, Sergey
We propose a method for learning expressive energy-based policies for continuous states and actions, which has been feasible only in tabular domains before. We apply our method to learning maximum entropy policies, resulting into a new algorithm, called soft Q-learning, that expresses the optimal policy via a Boltzmann distribution. We use the recently proposed amortized Stein variational gradient descent to learn a stochastic sampling network that approximates samples from this distribution. The benefits of the proposed algorithm include improved exploration and compositionality that allows transferring skills between tasks, which we confirm in simulated experiments with swimming and walking robots. We also draw a connection to actor-critic methods, which can be viewed performing approximate inference on the corresponding energy-based model.
Taking Machine Learning to the Next Level
Ethics are an Issue Don't kid yourself--introducing self-learning robots that can learn faster and better than humans will come with a huge range of issues. On our end, we can only program them to the extent of our human knowledge, which is always going to be limited. If we forget to set system safeties, we could have serious trouble on our hands in terms of public safety. On the other end, the question remains: do we really want to create a world of computers that think--and do--via their own free will, especially when they are smarter than humans? That's definitely an issue we need to reflect on before jumping too far into the reinforcement learning landscape.
Reward-Balancing for Statistical Spoken Dialogue Systems using Multi-objective Reinforcement Learning
Ultes, Stefan, Budzianowski, Paweล, Casanueva, Iรฑigo, Mrkลกiฤ, Nikola, Rojas-Barahona, Lina, Su, Pei-Hao, Wen, Tsung-Hsien, Gaลกiฤ, Milica, Young, Steve
Reinforcement learning is widely used for dialogue policy optimization where the reward function often consists of more than one component, e.g., the dialogue success and the dialogue length. In this work, we propose a structured method for finding a good balance between these components by searching for the optimal reward component weighting. To render this search feasible, we use multi-objective reinforcement learning to significantly reduce the number of training dialogues required. We apply our proposed method to find optimized component weights for six domains and compare them to a default baseline.
Learning model-based planning from scratch
Pascanu, Razvan, Li, Yujia, Vinyals, Oriol, Heess, Nicolas, Buesing, Lars, Racaniรจre, Sebastien, Reichert, David, Weber, Thรฉophane, Wierstra, Daan, Battaglia, Peter
Conventional wisdom holds that model-based planning is a powerful approach to sequential decision-making. It is often very challenging in practice, however, because while a model can be used to evaluate a plan, it does not prescribe how to construct a plan. Here we introduce the "Imagination-based Planner", the first model-based, sequential decision-making agent that can learn to construct, evaluate, and execute plans. Before any action, it can perform a variable number of imagination steps, which involve proposing an imagined action and evaluating it with its model-based imagination. All imagined actions and outcomes are aggregated, iteratively, into a "plan context" which conditions future real and imagined actions. The agent can even decide how to imagine: testing out alternative imagined actions, chaining sequences of actions together, or building a more complex "imagination tree" by navigating flexibly among the previously imagined states using a learned policy. And our agent can learn to plan economically, jointly optimizing for external rewards and computational costs associated with using its imagination. We show that our architecture can learn to solve a challenging continuous control problem, and also learn elaborate planning strategies in a discrete maze-solving task. Our work opens a new direction toward learning the components of a model-based planning system and how to use them.
Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks
Finn, Chelsea, Abbeel, Pieter, Levine, Sergey
We propose an algorithm for meta-learning that is model-agnostic, in the sense that it is compatible with any model trained with gradient descent and applicable to a variety of different learning problems, including classification, regression, and reinforcement learning. The goal of meta-learning is to train a model on a variety of learning tasks, such that it can solve new learning tasks using only a small number of training samples. In our approach, the parameters of the model are explicitly trained such that a small number of gradient steps with a small amount of training data from a new task will produce good generalization performance on that task. In effect, our method trains the model to be easy to fine-tune. We demonstrate that this approach leads to state-of-the-art performance on two few-shot image classification benchmarks, produces good results on few-shot regression, and accelerates fine-tuning for policy gradient reinforcement learning with neural network policies.
Freeway Merging in Congested Traffic based on Multipolicy Decision Making with Passive Actor Critic
Nishi, Tomoki, Doshi, Prashant, Prokhorov, Danil
Freeway merging in congested traffic is a significant challenge toward fully automated driving. Merging vehicles need to decide not only how to merge into a spot, but also where to merge. We present a method for the freeway merging based on multi-policy decision making with a reinforcement learning method called {\em passive actor-critic} (pAC), which learns with less knowledge of the system and without active exploration. The method selects a merging spot candidate by using the state value learned with pAC. We evaluate our method using real traffic data. Our experiments show that pAC achieves 92\% success rate to merge into a freeway, which is comparable to human decision making.
Distral: Robust Multitask Reinforcement Learning
Teh, Yee Whye, Bapst, Victor, Czarnecki, Wojciech Marian, Quan, John, Kirkpatrick, James, Hadsell, Raia, Heess, Nicolas, Pascanu, Razvan
Most deep reinforcement learning algorithms are data inefficient in complex and rich environments, limiting their applicability to many scenarios. One direction for improving data efficiency is multitask learning with shared neural network parameters, where efficiency may be improved through transfer across related tasks. In practice, however, this is not usually observed, because gradients from different tasks can interfere negatively, making learning unstable and sometimes even less data efficient. Another issue is the different reward schemes between tasks, which can easily lead to one task dominating the learning of a shared model. We propose a new approach for joint training of multiple tasks, which we refer to as Distral (Distill & transfer learning). Instead of sharing parameters between the different workers, we propose to share a "distilled" policy that captures common behaviour across tasks. Each worker is trained to solve its own task while constrained to stay close to the shared policy, while the shared policy is trained by distillation to be the centroid of all task policies. Both aspects of the learning process are derived by optimizing a joint objective function. We show that our approach supports efficient transfer on complex 3D environments, outperforming several related methods. Moreover, the proposed learning process is more robust and more stable---attributes that are critical in deep reinforcement learning.