AITopics | Reinforcement Learning

Collaborating Authors

Reinforcement Learning

"Reinforcement learning is learning what to do – how to map situations to actions – so as to maximize a numerical reward signal. The learner is not told which actions to take, as in most forms of machine learning, but instead must discover which actions yield the most reward by trying them."
– Sutton, Richard S. and Andrew G. Barto. Reinforcement Learning: An Introduction. (1.1). MIT Press, Cambridge, MA, 1998.

News Overviews Instructional Materials AI-Alerts Classics

Communication-Efficient Distributed Reinforcement Learning

Chen, Tianyi, Zhang, Kaiqing, Giannakis, Georgios B., Başar, Tamer

arXiv.org Machine LearningDec-7-2018

This paper studies the distributed reinforcement learning (DRL) problem involving a central controller and a group of learners. Two DRL settings that find broad applications are considered: multi-agent reinforcement learning (RL) and parallel RL. In both settings, frequent information exchange between the learners and the controller are required. However, for many distributed systems, e.g., parallel machines for training deep RL algorithms, and multi-robot systems for learning the optimal coordination strategies, the overhead caused by frequent communication is not negligible and becomes the bottleneck of the overall performance. To overcome this challenge, we develop a new policy gradient method that is amenable to efficient implementation in such communication-constrained settings. By adaptively skipping the policy gradient communication, our method can reduce the communication overhead without degrading the learning accuracy. Analytically, we can establish that i) the convergence rate of our algorithm is the same as the vanilla policy gradient for the DRL tasks; and, ii) if the distributed computing units are heterogeneous in terms of their reward functions and initial state distributions, the number of communication rounds needed to achieve a targeted learning accuracy is reduced. Numerical experiments on a popular multi-agent RL benchmark corroborate the significant communication reduction of our algorithm compared to the alternatives.

artificial intelligence, machine learning, reinforcement learning, (14 more...)

arXiv.org Machine Learning

1812.03239

Country:

Europe (1.00)
North America > United States > Minnesota (0.28)

Genre: Research Report (0.70)

Industry: Leisure & Entertainment (0.67)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

Taking the Scenic Route: Automatic Exploration for Videogames

Zhan, Zeping, Aytemiz, Batu, Smith, Adam M.

arXiv.org Artificial IntelligenceDec-7-2018

Machine playtesting tools and game moment search engines require exposure to the diversity of a game's state space if they are to report on or index the most interesting moments of possible play. Meanwhile, mobile app distribution services would like to quickly determine if a freshly-uploaded game is fit to be published. Having access to a semantic map of reachable states in the game would enable efficient inference in these applications. However, human gameplay data is expensive to acquire relative to the coverage of a game that it provides. We show that off-the-shelf automatic exploration strategies can explore with an effectiveness comparable to human gameplay on the same timescale. We contribute generic methods for quantifying exploration quality as a function of time and demonstrate our metric on several elementary techniques and human players on a collection of commercial games sampled from multiple game platforms (from Atari 2600 to Nintendo 64). Emphasizing the diversity of states reached and the semantic map extracted, this work makes productive contrast with the focus on finding a behavior policy or optimizing game score used in most automatic game playing research.

information retrieval, machine learning, reinforcement learning, (19 more...)

arXiv.org Artificial Intelligence

1812.03125

Country: North America > United States > California (0.28)

Genre: Research Report (0.50)

Industry: Leisure & Entertainment > Games > Computer Games (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.48)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.34)

Add feedback

A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play

ScienceDec-6-2018, 22:42:10 GMT

Computers can beat humans at increasingly complex games, including chess and Go. However, these programs are typically constructed for a particular game, exploiting its properties, such as the symmetries of the board on which it is played. Silver et al. developed a program called AlphaZero, which taught itself to play Go, chess, and shogi (a Japanese version of chess) (see the Editorial, and the Perspective by Campbell). AlphaZero managed to beat state-of-the-art programs specializing in these three games. The ability of AlphaZero to adapt to various game rules is a notable step toward achieving a general game-playing system.

alphazero, machine learning, reinforcement learning, (16 more...)

Science

Industry: Leisure & Entertainment > Games > Chess (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.88)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.74)
Information Technology > Artificial Intelligence > Representation & Reasoning > Search (0.70)

Add feedback

Mastering board games

ScienceDec-6-2018, 22:37:17 GMT

From the earliest days of the computer era, games have been considered important vehicles for research in artificial intelligence (AI) (1). Game environments simplify many aspects of real-world problems yet retain sufficient complexity to challenge humans and machines alike. Most programs for playing classic board games have been largely human-engineered (2, 3). Sophisticated search methods, complex evaluation functions, and a variety of game-specific tricks have allowed programs to surpass the best human players. More recently, a learning approach achieved superhuman performance in the hardest of the classic games, Go (4), but was specific for this game and took advantage of human-derived game–specific knowledge.

alphazero, machine learning, reinforcement learning, (15 more...)

Science

Industry:

Leisure & Entertainment > Games > Computer Games (0.52)
Leisure & Entertainment > Games > Chess (0.37)

Technology:

Information Technology > Artificial Intelligence > Games (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.81)

Add feedback

Top-K Off-Policy Correction for a REINFORCE Recommender System

Chen, Minmin, Beutel, Alex, Covington, Paul, Jain, Sagar, Belletti, Francois, Chi, Ed

arXiv.org Machine LearningDec-6-2018

Industrial recommender systems deal with extremely large action spaces -- many millions of items to recommend. Moreover, they need to serve billions of users, who are unique at any point in time, making a complex user state space. Luckily, huge quantities of logged implicit feedback (e.g., user clicks, dwell time) are available for learning. Learning from the logged feedback is however subject to biases caused by only observing feedback on recommendations selected by the previous versions of the recommender. In this work, we present a general recipe of addressing such biases in a production top-K recommender system at Youtube, built with a policy-gradient-based algorithm, i.e. REINFORCE. The contributions of the paper are: (1) scaling REINFORCE to a production recommender system with an action space on the orders of millions; (2) applying off-policy correction to address data biases in learning from logged feedback collected from multiple behavior policies; (3) proposing a novel top-K off-policy correction to account for our policy recommending multiple items at a time; (4) showcasing the value of exploration. We demonstrate the efficacy of our approaches through a series of simulations and multiple live experiments on Youtube.

artificial intelligence, machine learning, reinforcement learning, (18 more...)

arXiv.org Machine Learning

1812.02353

Country: North America > United States (0.28)

Genre: Research Report (0.64)

Industry: Leisure & Entertainment (0.68)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Personal Assistant Systems (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

A new multilayer optical film optimal method based on deep q-learning

Jiang, Anqing, Yoshie, Osamu, Chen, LiangYao

arXiv.org Machine LearningDec-6-2018

Multi-layer optical film has been found to afford important applications in optical communication, optical absorbers, optical filters, etc. Different algorithms of multi-layer optical film design has been developed, as simplex method, colony algorithm, genetic algorithm. These algorithms rapidly promote the design and manufacture of multi-layer films. However, traditional numerical algorithms of converge to local optimum. This means that the algorithms can not give a global optimal solution to the material researchers. In recent years, due to the rapid development to far artificial intelligence, to optimize optical film structure using AI algorithm has become possible. In this paper, we will introduce a new optical film design algorithm based on the deep Q learning. This model can converge the global optimum of the optical thin film structure, this will greatly improve the design efficiency of multi-layer films.

artificial intelligence, machine learning, reinforcement learning, (17 more...)

arXiv.org Machine Learning

1812.02873

Country:

North America > United States (0.14)
Asia > Japan (0.14)
Asia > China (0.14)

Genre: Research Report (0.50)

Industry:

Leisure & Entertainment (0.69)
Media (0.48)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

Off-Policy Deep Reinforcement Learning without Exploration

Fujimoto, Scott, Meger, David, Precup, Doina

arXiv.org Artificial IntelligenceDec-6-2018

Reinforcement learning traditionally considers the task of balancing exploration and exploitation. This work examines batch reinforcement learning--the task of maximally exploiting a given batch of off-policy data, without further data collection. We demonstrate that due to errors introduced by extrapolation, standard off-policy deep reinforcement learning algorithms, such as DQN and DDPG, are only capable of learning with data correlated to their current policy, making them ineffective for most off-policy applications. We introduce a novel class of off-policy algorithms, batch-constrained reinforcement learning, which restricts the action space to force the agent towards behaving on-policy with respect to a subset of the given data. We extend this notion to deep reinforcement learning, and to the best of our knowledge, present the first continuous control deep reinforcement learning algorithm which can learn effectively from uncorrelated off-policy data.

artificial intelligence, machine learning, reinforcement learning, (16 more...)

arXiv.org Artificial Intelligence

1812.029

Country: North America > Canada (0.28)

Genre: Research Report > New Finding (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

Finite-Sample Analyses for Fully Decentralized Multi-Agent Reinforcement Learning

Zhang, Kaiqing, Yang, Zhuoran, Liu, Han, Zhang, Tong, Başar, Tamer

arXiv.org Artificial IntelligenceDec-6-2018

Despite the increasing interest in multi-agent reinforcement learning (MARL) in the community, understanding its theoretical foundation has long been recognized as a challenging problem. In this work, we make an attempt towards addressing this problem, by providing finite-sample analyses for fully decentralized MARL. Specifically, we consider two fully decentralized MARL settings, where teams of agents are connected by time-varying communication networks, and either collaborate or compete in a zero-sum game, without the absence of any central controller. These settings cover many conventional MARL settings in the literature. For both settings, we develop batch MARL algorithms that can be implemented in a fully decentralized fashion, and quantify the finite-sample errors of the estimated action-value functions. Our error analyses characterize how the function class, the number of samples within each iteration, and the number of iterations determine the statistical accuracy of the proposed algorithms. Our results, compared to the finite-sample bounds for single-agent RL, identify the involvement of additional error terms caused by decentralized computation, which is inherent in our decentralized MARL setting. To our knowledge, our work appears to be the first finite-sample analyses for MARL, which sheds light on understanding both the sample and computational efficiency of MARL algorithms.

artificial intelligence, machine learning, reinforcement learning, (18 more...)

arXiv.org Artificial Intelligence

1812.02783

Genre: Research Report > New Finding (0.34)

Industry: Leisure & Entertainment > Games (0.67)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents > Agent Societies (0.67)

Add feedback

Provably Efficient Maximum Entropy Exploration

Hazan, Elad, Kakade, Sham M., Singh, Karan, Van Soest, Abby

arXiv.org Artificial IntelligenceDec-6-2018

Suppose an agent is in a (possibly unknown) Markov decision process (MDP) in the absence of a reward signal, what might we hope that an agent can efficiently learn to do? One natural, intrinsically defined, objective problem is for the agent to learn a policy which induces a distribution over state space that is as uniform as possible, which can be measured in an entropic sense. Despite the corresponding mathematical program being non-convex, our main result provides a provably efficient method (both in terms of sample size and computational complexity) to construct such a maximum-entropy exploratory policy. Key to our algorithmic methodology is utilizing the conditional gradient method (a.k.a. the Frank-Wolfe algorithm) which utilizes an approximate MDP solver.

artificial intelligence, machine learning, reinforcement learning, (18 more...)

arXiv.org Artificial Intelligence

1812.0269

Country: Europe (0.68)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.71)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Maximum Entropy (0.61)

Add feedback

Deep Reinforcement Learning and the Deadly Triad

van Hasselt, Hado, Doron, Yotam, Strub, Florian, Hessel, Matteo, Sonnerat, Nicolas, Modayil, Joseph

arXiv.org Artificial IntelligenceDec-6-2018

We know from reinforcement learning theory that temporal difference learning can fail in certain cases. Sutton and Barto (2018) identify a deadly triad of function approximation, bootstrapping, and off-policy learning. When these three properties are combined, learning can diverge with the value estimates becoming unbounded. However, several algorithms successfully combine these three properties, which indicates that there is at least a partial gap in our understanding. In this work, we investigate the impact of the deadly triad in practice, in the context of a family of popular deep reinforcement learning models - deep Q-networks trained with experience replay - analysing how the components of this system play a role in the emergence of the deadly triad, and in the agent's performance

artificial intelligence, machine learning, reinforcement learning, (15 more...)

arXiv.org Artificial Intelligence

1812.02648

Country:

North America (0.68)
Europe > United Kingdom > England (0.28)

Genre: Research Report (0.82)

Industry: Leisure & Entertainment > Sports (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback