AITopics | Reinforcement Learning

Collaborating Authors

Reinforcement Learning

"Reinforcement learning is learning what to do – how to map situations to actions – so as to maximize a numerical reward signal. The learner is not told which actions to take, as in most forms of machine learning, but instead must discover which actions yield the most reward by trying them."
– Sutton, Richard S. and Andrew G. Barto. Reinforcement Learning: An Introduction. (1.1). MIT Press, Cambridge, MA, 1998.

News Overviews Instructional Materials AI-Alerts Classics

Dota 2 with Large Scale Deep Reinforcement Learning

OpenAI, null, :, null, Berner, Christopher, Brockman, Greg, Chan, Brooke, Cheung, Vicki, Dębiak, Przemysław, Dennison, Christy, Farhi, David, Fischer, Quirin, Hashme, Shariq, Hesse, Chris, Józefowicz, Rafal, Gray, Scott, Olsson, Catherine, Pachocki, Jakub, Petrov, Michael, Pinto, Henrique Pondé de Oliveira, Raiman, Jonathan, Salimans, Tim, Schlatter, Jeremy, Schneider, Jonas, Sidor, Szymon, Sutskever, Ilya, Tang, Jie, Wolski, Filip, Zhang, Susan

arXiv.org Machine LearningDec-13-2019

On April 13th, 2019, OpenAI Five became the first AI system to defeat the world champions at an esports game. The game of Dota 2 presents novel challenges for AI systems such as long time horizons, imperfect information, and complex, continuous state-action spaces, all challenges which will become increasingly central to more capable AI systems. OpenAI Five leveraged existing reinforcement learning techniques, scaled to learn from batches of approximately 2 million frames every 2 seconds. We developed a distributed training system and tools for continual training which allowed us to train OpenAI Five for 10 months. By defeating the Dota 2 world champion (Team OG), OpenAI Five demonstrates that self-play reinforcement learning can achieve superhuman performance on a difficult task.

agent, dota 2, experiment, (17 more...)

arXiv.org Machine Learning

1912.0668

Country:

North America > United States > California > San Francisco County > San Francisco (0.14)
North America > United States > New York > New York County > New York City (0.04)
Asia > Middle East > Jordan (0.04)
(6 more...)

Genre: Research Report > New Finding (0.45)

Industry:

Leisure & Entertainment > Sports (1.00)
Leisure & Entertainment > Games > Computer Games (1.00)
Information Technology (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.89)

Add feedback

High dimensional precision medicine from patient-derived xenografts

Rashid, Naim U., Luckett, Daniel J., Chen, Jingxiang, Lawson, Michael T., Wang, Longshaokan, Zhang, Yunshu, Laber, Eric B., Liu, Yufeng, Yeh, Jen Jen, Zeng, Donglin, Kosorok, Michael R.

arXiv.org Machine LearningDec-13-2019

The complexity of human cancer often results in significant heterogeneity in response to treatment. Precision medicine offers potential to improve patient outcomes by leveraging this heterogeneity. Individualized treatment rules (ITRs) formalize precision medicine as maps from the patient covariate space into the space of allowable treatments. The optimal ITR is that which maximizes the mean of a clinical outcome in a population of interest. Patient-derived xenograft (PDX) studies permit the evaluation of multiple treatments within a single tumor and thus are ideally suited for estimating optimal ITRs. PDX data are characterized by correlated outcomes, a high-dimensional feature space, and a large number of treatments. Existing methods for estimating optimal ITRs do not take advantage of the unique structure of PDX data or handle the associated challenges well. In this paper, we explore machine learning methods for estimating optimal ITRs from PDX data. We analyze data from a large PDX study to identify biomarkers that are informative for developing personalized treatment recommendations in multiple cancers. We estimate optimal ITRs using regression-based approaches such as Q-learning and direct search methods such as outcome weighted learning. Finally, we implement a superlearner approach to combine a set of estimated ITRs and show that the resulting ITR performs better than any of the input ITRs, mitigating uncertainty regarding user choice of any particular ITR estimation methodology. Our results indicate that PDX data are a valuable resource for developing individualized treatment strategies in oncology.

cancer type, itr, pdx line, (15 more...)

arXiv.org Machine Learning

1912.06667

Country:

North America > United States > North Carolina > Wake County > Raleigh (0.04)
North America > United States > North Carolina > Orange County > Chapel Hill (0.04)
North America > United States > Michigan (0.04)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (0.90)

Industry:

Health & Medicine > Therapeutic Area > Oncology (1.00)
Health & Medicine > Pharmaceuticals & Biotechnology (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.57)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)

Add feedback

Provably Efficient Reinforcement Learning with Aggregated States

Dong, Shi, Van Roy, Benjamin, Zhou, Zhengyuan

arXiv.org Machine LearningDec-13-2019

We establish that an optimistic variant of Q-learning applied to a finite-horizon episodic Markov decision process with an aggregated state representation incurs regret $\tilde{\mathcal{O}}(\sqrt{H^5 M K} + \epsilon HK)$, where $H$ is the horizon, $M$ is the number of aggregate states, $K$ is the number of episodes, and $\epsilon$ is the largest difference between any pair of optimal state-action values associated with a common aggregate state. Notably, this regret bound does not depend on the number of states or actions. To the best of our knowledge, this is the first such result pertaining to a reinforcement learning algorithm applied with nontrivial value function approximation without any restrictions on the Markov decision process.

aggregate state, aggregation, algorithm, (13 more...)

arXiv.org Machine Learning

1912.06366

Country:

Asia > Middle East > Jordan (0.04)
North America > United States > New York (0.04)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.54)

Add feedback

Breakthrough Research In Reinforcement Learning From 2019

#artificialintelligenceDec-12-2019, 23:30:00 GMT

Reinforcement learning (RL) continues to be less valuable for business applications than supervised learning, and even unsupervised learning. It is successfully applied only in areas where huge amounts of simulated data can be generated, like robotics and games. However, many experts recognize RL as a promising path towards Artificial General Intelligence (AGI), or true intelligence. Thus, research teams from top institutions and tech leaders are seeking ways to make RL algorithms more sample-efficient and stable. We've selected and summarized 10 research papers that we think are representative of the latest research trends in reinforcement learning. The papers explore, among others, the interaction of multiple agents, off-policy learning, and more efficient exploration.

agent, algorithm, sample efficiency, (14 more...)

#artificialintelligence

Genre: Research Report > New Finding (0.47)

Industry: Education (0.47)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

Introduction to Reinforcement Learning

#artificialintelligenceDec-12-2019, 10:58:39 GMT

Edward observed his cats as they tried to escape from home-made puzzle boxes. Puzzles were simple, all cats had to do was pull some string or push a poll and they were out. When first encountered with a puzzle cats took a long time to solve it. However, when faced with the same or similar problem, cats were able to solve it and escape much faster. Responses that produce a satisfying effect in a particular situation become more likely to occur again in that situation, and responses that produce a discomforting effect become less likely to occur again in that situation.

agent, learning, reinforcement, (14 more...)

#artificialintelligence

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.73)

Add feedback

Training multi-agent AI systems to solve complex tasks through cooperation

#artificialintelligenceDec-12-2019, 04:02:56 GMT

A novel approach to cooperative multi-agent reinforcement learning (RL) that assigns tasks to individual agents within a group, thereby improving the entire group's ability to collaborate. We tested this method in the real-time strategy game StarCraft: Brood War, and found that our RL-trained model significantly outperformed computer-controlled players that relied on carefully tuned rule-based baselines. Perhaps most important, these gains carried over to matches with significantly larger armies than what we included in our training scenarios. We're releasing the source code for this approach on our TorchCraftAI GitHub repository, and detailing our results, which indicate that treating collaborative multi-agent RL as a dynamic assignment problem can lead to groups of agents that are better at generalizing to more complex situations. Our approach focuses on multi-agent collaborative (MAC) problems where agents have to carry out multiple intermediate tasks in order to accomplish a larger one.

agent, solve complex task, training multi-agent ai system, (8 more...)

#artificialintelligence

Genre: Research Report (0.57)

Industry: Leisure & Entertainment > Games > Computer Games (0.59)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents > Agent Societies (0.60)

Add feedback

Control-Tutored Reinforcement Learning

De Lellis, Francesco, Auletta, Fabrizia, Russo, Giovanni, De Lellis, Piero, di Bernardo, Mario

arXiv.org Machine LearningDec-12-2019

We introduce a control-tutored reinforcement learning (CTRL) algorithm. The idea is to enhance tabular learning algorithms so as to improve the exploration of the state-space, and substantially reduce learning times by leveraging some limited knowledge of the plant encoded into a tutoring model-based control strategy. We illustrate the benefits of our novel approach and its effectiveness by using the problem of controlling one or more agents to herd and contain within a goal region a set of target free-roving agents in the plane.

agent, algorithm, herder, (13 more...)

arXiv.org Machine Learning

1912.06085

Country:

Europe > Italy (0.04)
Europe > Ireland > Leinster > County Dublin > Dublin (0.04)

Genre: Research Report (0.84)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

Provably Efficient Exploration in Policy Optimization

Cai, Qi, Yang, Zhuoran, Jin, Chi, Wang, Zhaoran

arXiv.org Machine LearningDec-12-2019

While policy-based reinforcement learning (RL) achieves tremendous successes in practice, it is significantly less understood in theory, especially compared with value-based RL. In particular, it remains elusive how to design a provably efficient policy optimization algorithm that incorporates exploration. To bridge such a gap, this paper proposes an Optimistic variant of the Proximal Policy Optimization algorithm (OPPO), which follows an "optimistic version" of the policy gradient direction. This paper proves that, in the problem of episodic Markov decision process with linear function approximation, unknown transition, and adversarial reward with full-information feedback, OPPO achieves $\tilde{O}(\sqrt{d^3 H^3 T})$ regret. Here $d$ is the feature dimension, $H$ is the episode horizon, and $T$ is the total number of steps. To the best of our knowledge, OPPO is the first provably efficient policy optimization algorithm that explores.

algorithm 1, arxiv preprint arxiv, reward function, (10 more...)

arXiv.org Machine Learning

1912.0583

Country: Asia > Middle East > Jordan (0.04)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.34)

Add feedback

Recruitment-imitation Mechanism for Evolutionary Reinforcement Learning

Lü, Shuai, Han, Shuai, Zhou, Wenbo, Zhang, Junwei

arXiv.org Artificial IntelligenceDec-12-2019

Reinforcement learning, evolutionary algorithms and imitation learning are three principal methods to deal with continuous control tasks. Reinforcement learning is sample efficient, yet sensitive to hyper-parameters setting and needs efficient exploration; Evolutionary algorithms are stable, but with low sample efficiency; Imitation learning is both sample efficient and stable, however it requires the guidance of expert data. In this paper, we propose Recruitment-imitation Mechanism (RIM) for evolutionary reinforcement learning, a scalable framework that combines advantages of the three methods mentioned above. The core of this framework is a dual-actors and single critic reinforcement learning agent. This agent can recruit high-fitness actors from the population of evolutionary algorithms, which instructs itself to learn from experience replay buffer. At the same time, low-fitness actors in the evolutionary population can imitate behavior patterns of the reinforcement learning agent and improve their adaptability. Reinforcement and imitation learners in this framework can be replaced with any off-policy actor-critic reinforcement learner or data-driven imitation learner. We evaluate RIM on a series of benchmarks for continuous control tasks in Mujoco. The experimental results show that RIM outperforms prior evolutionary or reinforcement learning methods. The performance of RIM's components is significantly better than components of previous evolutionary reinforcement learning algorithm, and the recruitment using soft update enables reinforcement learning agent to learn faster than that using hard update.

agent, algorithm, reinforcement, (15 more...)

arXiv.org Artificial Intelligence

1912.0631

Country:

Asia > Middle East > Jordan (0.04)
Asia > China > Jilin Province (0.04)

Genre: Research Report (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

Learning To Reach Goals Without Reinforcement Learning

Ghosh, Dibya, Gupta, Abhishek, Fu, Justin, Reddy, Ashwin, Devin, Coline, Eysenbach, Benjamin, Levine, Sergey

arXiv.org Artificial IntelligenceDec-12-2019

L EARNING TO R EACH G OALS WITHOUT R EINFORCEMENTL EARNING Dibya Ghosh* 1, Abhishek Gupta* 1, Justin Fu 1, Ashwin Reddy 1, Coline Devin 1 Benjamin Eysenbach 2 Sergey Levine 1 1 University of California Berkeley 2 Carnegie Mellon University A BSTRACT Imitation learning algorithms provide a simple and straightforward approach for training control policies via supervised learning. By maximizing the likelihood of good actions provided by an expert demonstrator, supervised imitation learning can produce effective policies without the algorithmic complexities and optimization challenges of reinforcement learning, at the cost of requiring an expert demonstrator to provide the demonstrations. In this paper, we ask: can we take insights from imitation learning to design algorithms that can effectively acquire optimal policies from scratch without any expert demonstrations? The key observation that makes this possible is that, in the multi-task setting, trajectories that are generated by a suboptimal policy can still serve as optimal examples for other tasks. In particular, when tasks correspond to different goals, every trajectory is a successful demonstration for the goal state that it actually reaches. We propose a simple algorithm for learning goal-reaching behaviors without any demonstrations, complicated user-provided reward functions, or complex reinforcement learning methods. Our method simply maximizes the likelihood of actions the agent actually took in its own previous rollouts, conditioned on the goal being the state that it actually reached. Although related variants of this approach have been proposed previously in imitation learning with demonstrations, we show how this approach can effectively learn goal-reaching policies from scratch. We present a theoretical result linking self-supervised imitation learning and reinforcement learning, and empirical results showing that it performs competitively with more complex reinforcement learning methods on a range of challenging goal reaching problems, while yielding advantages in terms of stability and use of offline data. 1 I NTRODUCTION Reinforcement learning (RL) algorithms hold the promise of providing a broadly-applicable tool for automating control, and the combination of high-capacity deep neural network models with RL extends their applicability to settings with complex observations and that require intricate policies. However, RL with function approximation, including deep RL, presents a challenging optimization problem. Despite years of research, current deep RL methods are far from a turnkey solution: most popular methods lack convergence guarantees (Baird, 1995; Tsitsiklis & V an Roy, 1997) or require prohibitive numbers of samples (Schulman et al., 2015; Lillicrap et al., 2015).

algorithm, demonstration, trajectory, (15 more...)

arXiv.org Artificial Intelligence

1912.06088

Country:

North America > United States > California > Alameda County > Berkeley (0.24)
North America > United States > California > San Francisco County > San Francisco (0.14)
North America > United States > Florida > Broward County > Fort Lauderdale (0.04)
Europe > France > Hauts-de-France > Nord > Lille (0.04)

Genre: Research Report > New Finding (0.34)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)

Add feedback