Goto

Collaborating Authors

 Reinforcement Learning


Grandmaster level in StarCraft II using multi-agent reinforcement learning

#artificialintelligence

Many real-world applications require artificial agents to compete and coordinate with other agents in complex environments. As a stepping stone to this goal, the domain of StarCraft has emerged as an important challenge for artificial intelligence research, owing to its iconic and enduring status among the most difficult professional esports and its relevance to the real world in terms of its raw complexity and multi-agent challenges. Over the course of a decade and numerous competitions1,2,3, the strongest agents have simplified important aspects of the game, utilized superhuman capabilities, or employed hand-crafted sub-systems4. Despite these advantages, no previous agent has come close to matching the overall skill of top StarCraft players. We chose to address the challenge of StarCraft using general-purpose learning methods that are in principle applicable to other complex domains: a multi-agent reinforcement learning algorithm that uses data from both human and agent games within a diverse league of continually adapting strategies and counter-strategies, each represented by deep neural networks5,6.


Texas A&M and Simon Fraser Universities Open-Source RL Toolkit for Card Games

#artificialintelligence

In July the poker-playing bot Pluribus beat top professionals in a six-player no-limit Texas Hold'Em poker game. Pluribus taught itself from scratch using a form of reinforcement learning (RL) to become the first AI program to defeat elite humans in a poker game with more than two players. Compared to perfect information games such as Chess or Go, poker presents a number of unique challenges with its concealed cards, bluffing and other human strategies. Now a team of researchers from Texas A&M University and Canada's Simon Fraser University have open-sourced a toolkit called "RLCard" for applying RL research to card games. While RL has already produced a number of breakthroughs in goal-oriented tasks and has high potential, it's not without its drawbacks.


Fundamentals of Reinforcement Learning: Understanding Blackjack Strategy through Monte Carloโ€ฆ

#artificialintelligence

Welcome to GradientCrescent's special series on reinforcement learning. This series will serve to introduce some of the fundamental concepts in reinforcement learning using digestible examples, primarily obtained from the" Reinforcement Learning" text by Sutton et. Note that code in this series will be kept to a minimum- readers interested in implementations are directed to the official course, or our Github. The secondary purpose of this series is to reinforce (pun intended) my own learning in the field. Reinforcement Learning has taken the AI world by storm.


Qrash Course II: From Q-Learning to Gradient Policy & Actor-Critic in 12 Minutes

#artificialintelligence

Let's continue our journey and introduce two more algorithms: Gradient Policy and Actor-Critic. These two, along with DQN, are probably the most fundamental building-blocks of modern Deep Reinforcement Learning. The first question we should probably ask ourselves is why should we advance from Q-Learning? Where does it fail or underperforms? Well, this algorithm does have a few pitfalls, and it's important to understand them: How should we handle these situation?


Racing Cars in the Brisbane Office

#artificialintelligence

As a part of Expedia Group's partnership with AWS we recently took an amazing opportunity to host a DeepRacer competition in our Brisbane office. DeepRacer is designed to introduce people of all backgrounds to Machine Learning. The goal of the competition is to engineer a control loop for an autonomous toy racing car that enables the car to complete a full circuit of a physical race track in the shortest amount of time. This control loop is constructed using a Machine Learning technique called Reinforcement Learning. Reinforcement Learning encourages an autonomous machine to perform certain actions.


Disentangled Cumulants Help Successor Representations Transfer to New Tasks

arXiv.org Machine Learning

Biological intelligence can learn to solve many diverse tasks in a data efficient manner by re-using basic knowledge and skills from one task to another. Furthermore, many of such skills are acquired without explicit supervision in an intrinsically driven fashion. This is in contrast to the state-of-the-art reinforcement learning agents, which typically start learning each new task from scratch and struggle with knowledge transfer. In this paper we propose a principled way to learn a basis set of policies, which, when recombined through generalised policy improvement, come with guarantees on the coverage of the final task space. In particular, we concentrate on solving goal-based downstream tasks where the execution order of actions is not important. We demonstrate both theoretically and empirically that learning a small number of policies that reach intrinsically specified goal regions in a disentangled latent space can be re-used to quickly achieve a high level of performance on an exponentially larger number of externally specified, often significantly more complex downstream tasks. Our learning pipeline consists of two stages. First, the agent learns to perform intrinsically generated, goal-based tasks in the total absence of environmental rewards. Second, the agent leverages this experience to quickly achieve a high level of performance on numerous diverse externally specified tasks.


Biologically inspired architectures for sample-efficient deep reinforcement learning

arXiv.org Machine Learning

Deep reinforcement learning requires a heavy price in terms of sample efficiency and overparameterization in the neural networks used for function approximation. In this work, we use tensor factorization in order to learn more compact representation for reinforcement learning policies. We show empirically that in the low-data regime, it is possible to learn online policies with 2 to 10 times less total coefficients, with little to no loss of performance. We also leverage progress in second order optimization, and use the theory of wavelet scattering to further reduce the number of learned coefficients, by foregoing learning the topmost convolutional layer filters altogether. We evaluate our results on the Atari suite against recent baseline algorithms that represent the state-of-the-art in data efficiency, and get comparable results with an order of magnitude gain in weight parsimony.


Deep Reinforcement Learning for Multi-Driver Vehicle Dispatching and Repositioning Problem

arXiv.org Artificial Intelligence

--Order dispatching and driver repositioning (also known as fleet management) in the face of spatially and temporally varying supply and demand are central to a ride-sharing platform marketplace. Handcrafting heuristic solutions that account for the dynamics in these resource allocation problems is difficult, and may be better handled by an end-to-end machine learning method. Previous works have explored machine learning methods to the problem from a high-level perspective, where the learning method is responsible for either repositioning the drivers or dispatching orders, and as a further simplification, the drivers are considered independent agents maximizing their own reward functions. In this paper we present a deep reinforcement learning approach for tackling the full fleet management and dispatching problems. In addition to treating the drivers as individual agents, we consider the problem from a system-centric perspective, where a central fleet management agent is responsible for decision-making for all drivers. I NTRODUCTION The order dispatching and fleet management system at a ride-sharing company must make decisions both for assigning available drivers to nearby passengers (hereby called orders) and for repositioning drivers who have no nearby orders. These decisions have short-term effects on the revenue generated by the drivers and driver availability. In the long term they change the distribution of drivers across the city, which in turn has a critical impact on how well future orders can be served. Provident algorithmic solutions, which account for the short term and long-term consequences of their decisions can improve the quality of service of the ride-sharing platforms and are thus an important area of research. Recent works [1], [2] have successfully applied Deep Reinforcement Learning (RL) techniques to dispatching problems, such as the Traveling Salesman Problem (TSP) and the more general V ehicle Routing Problem (VRP) [3], however they have primarily focused on static ( i. e. those where all orders are known up front) and/or single-driver dispatching problems. In contrast to these problems, the fleet management and order dispatching problem ride-sharing platforms face has multiple drivers and dynamically changing supply and demand conditions. We refer to this dynamic dispatching and fleet management problem as the Multi-Driver V ehicle Dispatching and Repositioning Problem (MDVDRP). VRPs and other problems similar to the MDVDRP are studied in the field of combinatorial optimization. Exactly solving instances of these problems at the scale of real-world environment is computationally intractable [4].


Theory-based Causal Transfer: Integrating Instance-level Induction and Abstract-level Structure Learning

arXiv.org Artificial Intelligence

Learning transferable knowledge across similar but different settings is a fundamental component of generalized intelligence. In this paper, we approach the transfer learning challenge from a causal theory perspective. Our agent is endowed with two basic yet general theories for transfer learning: (i) a task shares a common abstract structure that is invariant across domains, and (ii) the behavior of specific features of the environment remain constant across domains. We adopt a Bayesian perspective of causal theory induction and use these theories to transfer knowledge between environments. Given these general theories, the goal is to train an agent by interactively exploring the problem space to (i) discover, form, and transfer useful abstract and structural knowledge, and (ii) induce useful knowledge from the instance-level attributes observed in the environment. A hierarchy of Bayesian structures is used to model abstract-level structural causal knowledge, and an instance-level associative learning scheme learns which specific objects can be used to induce state changes through interaction. This model-learning scheme is then integrated with a model-based planner to achieve a task in the OpenLock environment, a virtual ``escape room'' with a complex hierarchy that requires agents to reason about an abstract, generalized causal structure. We compare performances against a set of predominate model-free reinforcement learning(RL) algorithms. RL agents showed poor ability transferring learned knowledge across different trials. Whereas the proposed model revealed similar performance trends as human learners, and more importantly, demonstrated transfer behavior across trials and learning situations.


End-to-End Model-Free Reinforcement Learning for Urban Driving using Implicit Affordances

arXiv.org Artificial Intelligence

Solving this task is still an open problem and it seems complicated to handle such difficult and highly variable situations with classic rules-based approach. This is why a significant part of the state of the art in autonomous driving [20, 4, 5] focuses on end-to-end systems, i.e. learning driving policy from data without relying on handcrafted rules. Imitation learning (IL) [28] aims to reproduce the behavior of an expert (a human driver for autonomous driving) by learning to mimic the control the human driver applied in the same situation. This leverages the massive amount of data annotated with human driving that most of automotive manufacturer and supplier can obtain relatively easily. On the other side, as the human driver is always in an almost perfect situation, IL algorithms suffer from a distribution mismatch, i.e. the algorithm will never encounter failing cases and thus will not react appropriately in those conditions.