Reinforcement Learning
Dota 2 with Large Scale Deep Reinforcement Learning
OpenAI, null, :, null, Berner, Christopher, Brockman, Greg, Chan, Brooke, Cheung, Vicki, Dębiak, Przemysław, Dennison, Christy, Farhi, David, Fischer, Quirin, Hashme, Shariq, Hesse, Chris, Józefowicz, Rafal, Gray, Scott, Olsson, Catherine, Pachocki, Jakub, Petrov, Michael, Pinto, Henrique Pondé de Oliveira, Raiman, Jonathan, Salimans, Tim, Schlatter, Jeremy, Schneider, Jonas, Sidor, Szymon, Sutskever, Ilya, Tang, Jie, Wolski, Filip, Zhang, Susan
On April 13th, 2019, OpenAI Five became the first AI system to defeat the world champions at an esports game. The game of Dota 2 presents novel challenges for AI systems such as long time horizons, imperfect information, and complex, continuous state-action spaces, all challenges which will become increasingly central to more capable AI systems. OpenAI Five leveraged existing reinforcement learning techniques, scaled to learn from batches of approximately 2 million frames every 2 seconds. We developed a distributed training system and tools for continual training which allowed us to train OpenAI Five for 10 months. By defeating the Dota 2 world champion (Team OG), OpenAI Five demonstrates that self-play reinforcement learning can achieve superhuman performance on a difficult task.
High dimensional precision medicine from patient-derived xenografts
Rashid, Naim U., Luckett, Daniel J., Chen, Jingxiang, Lawson, Michael T., Wang, Longshaokan, Zhang, Yunshu, Laber, Eric B., Liu, Yufeng, Yeh, Jen Jen, Zeng, Donglin, Kosorok, Michael R.
The complexity of human cancer often results in significant heterogeneity in response to treatment. Precision medicine offers potential to improve patient outcomes by leveraging this heterogeneity. Individualized treatment rules (ITRs) formalize precision medicine as maps from the patient covariate space into the space of allowable treatments. The optimal ITR is that which maximizes the mean of a clinical outcome in a population of interest. Patient-derived xenograft (PDX) studies permit the evaluation of multiple treatments within a single tumor and thus are ideally suited for estimating optimal ITRs. PDX data are characterized by correlated outcomes, a high-dimensional feature space, and a large number of treatments. Existing methods for estimating optimal ITRs do not take advantage of the unique structure of PDX data or handle the associated challenges well. In this paper, we explore machine learning methods for estimating optimal ITRs from PDX data. We analyze data from a large PDX study to identify biomarkers that are informative for developing personalized treatment recommendations in multiple cancers. We estimate optimal ITRs using regression-based approaches such as Q-learning and direct search methods such as outcome weighted learning. Finally, we implement a superlearner approach to combine a set of estimated ITRs and show that the resulting ITR performs better than any of the input ITRs, mitigating uncertainty regarding user choice of any particular ITR estimation methodology. Our results indicate that PDX data are a valuable resource for developing individualized treatment strategies in oncology.
Provably Efficient Reinforcement Learning with Aggregated States
Dong, Shi, Van Roy, Benjamin, Zhou, Zhengyuan
We establish that an optimistic variant of Q-learning applied to a finite-horizon episodic Markov decision process with an aggregated state representation incurs regret $\tilde{\mathcal{O}}(\sqrt{H^5 M K} + \epsilon HK)$, where $H$ is the horizon, $M$ is the number of aggregate states, $K$ is the number of episodes, and $\epsilon$ is the largest difference between any pair of optimal state-action values associated with a common aggregate state. Notably, this regret bound does not depend on the number of states or actions. To the best of our knowledge, this is the first such result pertaining to a reinforcement learning algorithm applied with nontrivial value function approximation without any restrictions on the Markov decision process.
Breakthrough Research In Reinforcement Learning From 2019
Reinforcement learning (RL) continues to be less valuable for business applications than supervised learning, and even unsupervised learning. It is successfully applied only in areas where huge amounts of simulated data can be generated, like robotics and games. However, many experts recognize RL as a promising path towards Artificial General Intelligence (AGI), or true intelligence. Thus, research teams from top institutions and tech leaders are seeking ways to make RL algorithms more sample-efficient and stable. We've selected and summarized 10 research papers that we think are representative of the latest research trends in reinforcement learning. The papers explore, among others, the interaction of multiple agents, off-policy learning, and more efficient exploration.
Introduction to Reinforcement Learning
Edward observed his cats as they tried to escape from home-made puzzle boxes. Puzzles were simple, all cats had to do was pull some string or push a poll and they were out. When first encountered with a puzzle cats took a long time to solve it. However, when faced with the same or similar problem, cats were able to solve it and escape much faster. Responses that produce a satisfying effect in a particular situation become more likely to occur again in that situation, and responses that produce a discomforting effect become less likely to occur again in that situation.
Training multi-agent AI systems to solve complex tasks through cooperation
A novel approach to cooperative multi-agent reinforcement learning (RL) that assigns tasks to individual agents within a group, thereby improving the entire group's ability to collaborate. We tested this method in the real-time strategy game StarCraft: Brood War, and found that our RL-trained model significantly outperformed computer-controlled players that relied on carefully tuned rule-based baselines. Perhaps most important, these gains carried over to matches with significantly larger armies than what we included in our training scenarios. We're releasing the source code for this approach on our TorchCraftAI GitHub repository, and detailing our results, which indicate that treating collaborative multi-agent RL as a dynamic assignment problem can lead to groups of agents that are better at generalizing to more complex situations. Our approach focuses on multi-agent collaborative (MAC) problems where agents have to carry out multiple intermediate tasks in order to accomplish a larger one.
Control-Tutored Reinforcement Learning
De Lellis, Francesco, Auletta, Fabrizia, Russo, Giovanni, De Lellis, Piero, di Bernardo, Mario
We introduce a control-tutored reinforcement learning (CTRL) algorithm. The idea is to enhance tabular learning algorithms so as to improve the exploration of the state-space, and substantially reduce learning times by leveraging some limited knowledge of the plant encoded into a tutoring model-based control strategy. We illustrate the benefits of our novel approach and its effectiveness by using the problem of controlling one or more agents to herd and contain within a goal region a set of target free-roving agents in the plane.
Provably Efficient Exploration in Policy Optimization
Cai, Qi, Yang, Zhuoran, Jin, Chi, Wang, Zhaoran
While policy-based reinforcement learning (RL) achieves tremendous successes in practice, it is significantly less understood in theory, especially compared with value-based RL. In particular, it remains elusive how to design a provably efficient policy optimization algorithm that incorporates exploration. To bridge such a gap, this paper proposes an Optimistic variant of the Proximal Policy Optimization algorithm (OPPO), which follows an "optimistic version" of the policy gradient direction. This paper proves that, in the problem of episodic Markov decision process with linear function approximation, unknown transition, and adversarial reward with full-information feedback, OPPO achieves $\tilde{O}(\sqrt{d^3 H^3 T})$ regret. Here $d$ is the feature dimension, $H$ is the episode horizon, and $T$ is the total number of steps. To the best of our knowledge, OPPO is the first provably efficient policy optimization algorithm that explores.
Recruitment-imitation Mechanism for Evolutionary Reinforcement Learning
Lü, Shuai, Han, Shuai, Zhou, Wenbo, Zhang, Junwei
Reinforcement learning, evolutionary algorithms and imitation learning are three principal methods to deal with continuous control tasks. Reinforcement learning is sample efficient, yet sensitive to hyper-parameters setting and needs efficient exploration; Evolutionary algorithms are stable, but with low sample efficiency; Imitation learning is both sample efficient and stable, however it requires the guidance of expert data. In this paper, we propose Recruitment-imitation Mechanism (RIM) for evolutionary reinforcement learning, a scalable framework that combines advantages of the three methods mentioned above. The core of this framework is a dual-actors and single critic reinforcement learning agent. This agent can recruit high-fitness actors from the population of evolutionary algorithms, which instructs itself to learn from experience replay buffer. At the same time, low-fitness actors in the evolutionary population can imitate behavior patterns of the reinforcement learning agent and improve their adaptability. Reinforcement and imitation learners in this framework can be replaced with any off-policy actor-critic reinforcement learner or data-driven imitation learner. We evaluate RIM on a series of benchmarks for continuous control tasks in Mujoco. The experimental results show that RIM outperforms prior evolutionary or reinforcement learning methods. The performance of RIM's components is significantly better than components of previous evolutionary reinforcement learning algorithm, and the recruitment using soft update enables reinforcement learning agent to learn faster than that using hard update.
Learning To Reach Goals Without Reinforcement Learning
Ghosh, Dibya, Gupta, Abhishek, Fu, Justin, Reddy, Ashwin, Devin, Coline, Eysenbach, Benjamin, Levine, Sergey
L EARNING TO R EACH G OALS WITHOUT R EINFORCEMENTL EARNING Dibya Ghosh* 1, Abhishek Gupta* 1, Justin Fu 1, Ashwin Reddy 1, Coline Devin 1 Benjamin Eysenbach 2 Sergey Levine 1 1 University of California Berkeley 2 Carnegie Mellon University A BSTRACT Imitation learning algorithms provide a simple and straightforward approach for training control policies via supervised learning. By maximizing the likelihood of good actions provided by an expert demonstrator, supervised imitation learning can produce effective policies without the algorithmic complexities and optimization challenges of reinforcement learning, at the cost of requiring an expert demonstrator to provide the demonstrations. In this paper, we ask: can we take insights from imitation learning to design algorithms that can effectively acquire optimal policies from scratch without any expert demonstrations? The key observation that makes this possible is that, in the multi-task setting, trajectories that are generated by a suboptimal policy can still serve as optimal examples for other tasks. In particular, when tasks correspond to different goals, every trajectory is a successful demonstration for the goal state that it actually reaches. We propose a simple algorithm for learning goal-reaching behaviors without any demonstrations, complicated user-provided reward functions, or complex reinforcement learning methods. Our method simply maximizes the likelihood of actions the agent actually took in its own previous rollouts, conditioned on the goal being the state that it actually reached. Although related variants of this approach have been proposed previously in imitation learning with demonstrations, we show how this approach can effectively learn goal-reaching policies from scratch. We present a theoretical result linking self-supervised imitation learning and reinforcement learning, and empirical results showing that it performs competitively with more complex reinforcement learning methods on a range of challenging goal reaching problems, while yielding advantages in terms of stability and use of offline data. 1 I NTRODUCTION Reinforcement learning (RL) algorithms hold the promise of providing a broadly-applicable tool for automating control, and the combination of high-capacity deep neural network models with RL extends their applicability to settings with complex observations and that require intricate policies. However, RL with function approximation, including deep RL, presents a challenging optimization problem. Despite years of research, current deep RL methods are far from a turnkey solution: most popular methods lack convergence guarantees (Baird, 1995; Tsitsiklis & V an Roy, 1997) or require prohibitive numbers of samples (Schulman et al., 2015; Lillicrap et al., 2015).