exploration and uncertainty
Successor Uncertainties: Exploration and Uncertainty in Temporal Difference Learning
Posterior sampling for reinforcement learning (PSRL) is an effective method for balancing exploration and exploitation in reinforcement learning. Randomised value functions (RVF) can be viewed as a promising approach to scaling PSRL. However, we show that most contemporary algorithms combining RVF with neural network function approximation do not possess the properties which make PSRL effective, and provably fail in sparse reward problems. Moreover, we find that propagation of uncertainty, a property of PSRL previously thought important for exploration, does not preclude this failure. We use these insights to design Successor Uncertainties (SU), a cheap and easy to implement RVF algorithm that retains key properties of PSRL. SU is highly effective on hard tabular exploration benchmarks. Furthermore, on the Atari 2600 domain, it surpasses human performance on 38 of 49 games tested (achieving a median human normalised score of 2.09), and outperforms its closest RVF competitor, Bootstrapped DQN, on 36 of those.
Reviews: Successor Uncertainties: Exploration and Uncertainty in Temporal Difference Learning
This paper proposes using Bayesian linear regression to get a posterior over successor features as a way of representing uncertainty, from which they sample for exploration. I found the characterization of Randomised Policy Iteration to be strange, as it only seems to apply to UBE but not bootstrapped DQN, With bootstrapped DQN, each model in the ensemble is a value function pertaining to a different policy, thus there is no single reference policy. The ensemble is trying to represent a distribution of optimal value functions, rather than value functions for a single reference policy. Proposition 1: In the case of neural networks, and function approximation in general, it is very unlikely that we will get a factored distribution, so this claim does not seem applicable in general. In fact, in general there should be very high correlation between the q-values between nearby states. Is this claim a direct response to UBE? Also the analysis fixes the policy to consider the distribution of value functions, but this seems to not be how posterior sampling is normally considered, but rather only the way UBE considers it.
Successor Uncertainties: Exploration and Uncertainty in Temporal Difference Learning
Posterior sampling for reinforcement learning (PSRL) is an effective method for balancing exploration and exploitation in reinforcement learning. Randomised value functions (RVF) can be viewed as a promising approach to scaling PSRL. However, we show that most contemporary algorithms combining RVF with neural network function approximation do not possess the properties which make PSRL effective, and provably fail in sparse reward problems. Moreover, we find that propagation of uncertainty, a property of PSRL previously thought important for exploration, does not preclude this failure. We use these insights to design Successor Uncertainties (SU), a cheap and easy to implement RVF algorithm that retains key properties of PSRL. SU is highly effective on hard tabular exploration benchmarks.
Successor Uncertainties: Exploration and Uncertainty in Temporal Difference Learning
Janz, David, Hron, Jiri, Mazur, Przemysław, Hofmann, Katja, Hernández-Lobato, José Miguel, Tschiatschek, Sebastian
Posterior sampling for reinforcement learning (PSRL) is an effective method for balancing exploration and exploitation in reinforcement learning. Randomised value functions (RVF) can be viewed as a promising approach to scaling PSRL. However, we show that most contemporary algorithms combining RVF with neural network function approximation do not possess the properties which make PSRL effective, and provably fail in sparse reward problems. Moreover, we find that propagation of uncertainty, a property of PSRL previously thought important for exploration, does not preclude this failure. We use these insights to design Successor Uncertainties (SU), a cheap and easy to implement RVF algorithm that retains key properties of PSRL. SU is highly effective on hard tabular exploration benchmarks.