Goto

Collaborating Authors

 deep exploration



Deep Exploration via Bootstrapped DQN

Neural Information Processing Systems

Efficient exploration remains a major challenge for reinforcement learning (RL). Common dithering strategies for exploration, such as epsilon-greedy, do not carry out temporally-extended (or deep) exploration; this can lead to exponentially larger data requirements. However, most algorithms for statistically efficient RL are not computationally tractable in complex environments. Randomized value functions offer a promising approach to efficient exploration with generalization, but existing algorithms are not compatible with nonlinearly parameterized value functions. As a first step towards addressing such contexts we develop bootstrapped DQN. We demonstrate that bootstrapped DQN can combine deep exploration with deep neural networks for exponentially faster learning than any dithering strategy. In the Arcade Learning Environment bootstrapped DQN substantially improves learning speed and cumulative performance across most games.


Reviews: Deep Exploration via Bootstrapped DQN

Neural Information Processing Systems

First, it would take months to re-produce these experiments (besides the hardware requirements). Second, with such complicated algorithms it's hard to know what exactly is leading to the improvement. For this reason I find this kind of paper a little unscientific, but maybe this is how things have to be. I wonder, do the authors plan to release their code? Overall I think this is an interesting idea, but the authors have not convinced me that this is a principled approach.


Epistemic Monte Carlo Tree Search

arXiv.org Artificial Intelligence

The AlphaZero/MuZero (A/MZ) family of algorithms has achieved remarkable success across various challenging domains by integrating Monte Carlo Tree Search (MCTS) with learned models. Learned models introduce epistemic uncertainty, which is caused by learning from limited data and is useful for exploration in sparse reward environments. MCTS does not account for the propagation of this uncertainty however. To address this, we introduce Epistemic MCTS (EMCTS): a theoretically motivated approach to account for the epistemic uncertainty in search and harness the search for deep exploration. In the challenging sparse-reward task of writing code in the Assembly language subleq, AZ paired with our method achieves significantly higher sample efficiency over baseline AZ. Search with EMCTS solves variations of the commonly used hard-exploration benchmark Deep Sea - which baseline A/MZ are practically unable to solve - much faster than an otherwise equivalent method that does not use search for uncertainty estimation, demonstrating significant benefits from search for epistemic uncertainty estimation.


Deep Exploration via Bootstrapped DQN Ian Osband

Neural Information Processing Systems

E cient exploration remains a major challenge for reinforcement learning (RL). Common dithering strategies for exploration, such as '-greedy, do not carry out temporally-extended (or deep) exploration; this can lead to exponentially larger data requirements. However, most algorithms for statistically e cient RL are not computationally tractable in complex environments. Randomized value functions o er a promising approach to e cient exploration with generalization, but existing algorithms are not compatible with nonlinearly parameterized value functions. As a first step towards addressing such contexts we develop bootstrapped DQN. We demonstrate that bootstrapped DQN can combine deep exploration with deep neural networks for exponentially faster learning than any dithering strategy. In the Arcade Learning Environment bootstrapped DQN substantially improves learning speed and cumulative performance across most games.


Bag of Policies for Distributional Deep Exploration

arXiv.org Artificial Intelligence

Efficient exploration in complex environments remains Distributional RL (DiRL) has rapidly established its place a major challenge for reinforcement learning among reinforcement learning (RL) algorithms Bellemare (RL). Compared to previous Thompson samplinginspired et al. [2017] as a powerful improvement over nondistributional mechanisms that enable temporally extended value-based counterparts Lyle et al. [2019]. In exploration, i.e., deep exploration, we focus DiRL, the agent does not learn a single summary statistic of on deep exploration in distributional RL. We develop the return for each state-action pair, but instead learns the here a general purpose approach, Bag of Policies whole return distribution. The agent's behaviour is being (BoP), that can be built on top of any return evaluated for multiple possible consequences which in turn distribution estimator by maintaining a population affect the policy update. While this does lead to more stable of its copies. BoP consists of an ensemble of multiple learning and better performance Lyle et al. [2019], it does heads that are updated independently. During not itself change the way actions are selected; as distributional training, each episode is controlled by only one of extensions to value-based RL, in C51 Bellemare et al. the heads and the collected state-action pairs are [2017], QR-DQN Dabney et al. [2018b] the agent still takes used to update all heads off-policy, leading to distinct actions according to the mean of the estimated return distributions learning signals for each head which diversify in each state-action pair.


Deep Exploration for Recommendation Systems

arXiv.org Artificial Intelligence

We investigate the design of recommendation systems that can efficiently learn from sparse and delayed feedback. Deep Exploration can play an important role in such contexts, enabling a recommendation system to much more quickly assess a user's needs and personalize service. We design an algorithm based on Thompson Sampling that carries out Deep Exploration. We demonstrate through simulations that the algorithm can substantially amplify the rate of positive feedback relative to common recommendation system designs in a scalable fashion. These results demonstrate promise that we hope will inspire engineering of production recommendation systems that leverage Deep Exploration.


Langevin DQN

arXiv.org Artificial Intelligence

Algorithms that tackle deep exploration -- an important challenge in reinforcement learning -- have relied on epistemic uncertainty representation through ensembles or other hypermodels, exploration bonuses, or visitation count distributions. An open question is whether deep exploration can be achieved by an incremental reinforcement learning algorithm that tracks a single point estimate, without additional complexity required to account for epistemic uncertainty. We answer this question in the affirmative. In particular, we develop Langevin DQN, a variation of DQN that differs only in perturbing parameter updates with Gaussian noise, and demonstrate through a computational study that the algorithm achieves deep exploration. We also provide an intuition for why Langevin DQN performs deep exploration.


Deep Exploration via Bootstrapped DQN

Neural Information Processing Systems

Efficient exploration remains a major challenge for reinforcement learning (RL). Common dithering strategies for exploration, such as epsilon-greedy, do not carry out temporally-extended (or deep) exploration; this can lead to exponentially larger data requirements. However, most algorithms for statistically efficient RL are not computationally tractable in complex environments. Randomized value functions offer a promising approach to efficient exploration with generalization, but existing algorithms are not compatible with nonlinearly parameterized value functions. As a first step towards addressing such contexts we develop bootstrapped DQN.


ISL: Optimal Policy Learning With Optimal Exploration-Exploitation Trade-Off

arXiv.org Artificial Intelligence

Traditionally, off-policy learning algorithms (such as Q-learning) and exploration schemes have been derived separately. Often times, the exploration-exploitation dilemma being addressed through heuristics. In this article we show that both the learning equations and the exploration-exploitation strategy can be derived in tandem as the solution to a unique and well-posed optimization problem whose minimization leads to the optimal value function. We present a new algorithm following this idea. The algorithm is of the gradient type (and therefore has good convergence properties even when used in conjunction with function approximators such as neural networks); it is off-policy; and it specifies both the update equations and the strategy to address the exploration-exploitation dilemma. To the best of our knowledge, this is the first algorithm that has these properties.