Not enough data to create a plot.
Try a different view from the menu above.
Benjamin Van Roy
Scalable Coordinated Exploration in Concurrent Reinforcement Learning
Maria Dimakopoulou, Ian Osband, Benjamin Van Roy
We consider a team of reinforcement learning agents that concurrently operate in a common environment, and we develop an approach to efficient coordinated exploration that is suitable for problems of practical scale. Our approach builds on seed sampling[1] and randomized value function learning [11]. We demonstrate that, for simple tabular contexts, the approach is competitive with previously proposed tabular model learning methods [1]. With a higher-dimensional problem and a neural network value function representation, the approach learns quickly with far fewer agents than alternative exploration schemes.
Scalable Coordinated Exploration in Concurrent Reinforcement Learning
Maria Dimakopoulou, Ian Osband, Benjamin Van Roy
We consider a team of reinforcement learning agents that concurrently operate in a common environment, and we develop an approach to efficient coordinated exploration that is suitable for problems of practical scale. Our approach builds on seed sampling[1] and randomized value function learning [11]. We demonstrate that, for simple tabular contexts, the approach is competitive with previously proposed tabular model learning methods [1]. With a higher-dimensional problem and a neural network value function representation, the approach learns quickly with far fewer agents than alternative exploration schemes.
Information-Theoretic Confidence Bounds for Reinforcement Learning
Xiuyuan Lu, Benjamin Van Roy
We integrate information-theoretic concepts into the design and analysis of optimistic algorithms and Thompson sampling. By making a connection between information-theoretic quantities and confidence bounds, we obtain results that relate the per-period performance of the agent with its information gain about the environment, thus explicitly characterizing the exploration-exploitation tradeoff. The resulting cumulative regret bound depends on the agent's uncertainty over the environment and quantifies the value of prior information. We show applicability of this approach to several environments, including linear bandits, tabular MDPs, and factored MDPs. These examples demonstrate the potential of a general information-theoretic approach for the design and analysis of reinforcement learning algorithms.
Information-Theoretic Confidence Bounds for Reinforcement Learning
Xiuyuan Lu, Benjamin Van Roy
We integrate information-theoretic concepts into the design and analysis of optimistic algorithms and Thompson sampling. By making a connection between information-theoretic quantities and confidence bounds, we obtain results that relate the per-period performance of the agent with its information gain about the environment, thus explicitly characterizing the exploration-exploitation tradeoff. The resulting cumulative regret bound depends on the agent's uncertainty over the environment and quantifies the value of prior information. We show applicability of this approach to several environments, including linear bandits, tabular MDPs, and factored MDPs. These examples demonstrate the potential of a general information-theoretic approach for the design and analysis of reinforcement learning algorithms.
Deep Exploration via Bootstrapped DQN
Ian Osband, Charles Blundell, Alexander Pritzel, Benjamin Van Roy
E cient exploration remains a major challenge for reinforcement learning (RL). Common dithering strategies for exploration, such as '-greedy, do not carry out temporally-extended (or deep) exploration; this can lead to exponentially larger data requirements. However, most algorithms for statistically e cient RL are not computationally tractable in complex environments. Randomized value functions o er a promising approach to e cient exploration with generalization, but existing algorithms are not compatible with nonlinearly parameterized value functions. As a first step towards addressing such contexts we develop bootstrapped DQN. We demonstrate that bootstrapped DQN can combine deep exploration with deep neural networks for exponentially faster learning than any dithering strategy. In the Arcade Learning Environment bootstrapped DQN substantially improves learning speed and cumulative performance across most games.
Ensemble Sampling
Xiuyuan Lu, Benjamin Van Roy
Thompson sampling has emerged as an effective heuristic for a broad range of online decision problems. In its basic form, the algorithm requires computing and sampling from a posterior distribution over models, which is tractable only for simple special cases. This paper develops ensemble sampling, which aims to approximate Thompson sampling while maintaining tractability even in the face of complex models such as neural networks. Ensemble sampling dramatically expands on the range of applications for which Thompson sampling is viable. We establish a theoretical basis that supports the approach and present computational results that offer further insight.