Goto

Collaborating Authors

 psrl





h eft(s,a) f (|s,a) 1 i +Eρπt 1 h eft(s,a) f (|s,a) 1 i, (A.1) where (t): = Es ζ V

Neural Information Processing Systems

From the Posterior Sampling Lemma, we know that ifψ is the distribution off, then for any sigma-algebraσ(Ht)-measurablefunctiong, E[g(f)|Ht]=E[g(ft)|Ht]. We can further know from the construction of the confidence set (c.f. This lemma is widely adopted in RL. Proof can be found in various previous works, e.g. Prior work that shares similarities with ours contains DPI [59]and GPS [31,39]as dual policyoptimization procedures areadopted.


ConservativeDualPolicyOptimizationforEfficient Model-Based ReinforcementLearning

Neural Information Processing Systems

Based ontheprinciple ofoptimism inthefaceofuncertainty(OFU) [56,49,10],OFU-RL achievestheglobal optimality by ensuring that the optimistically biased value is close to the real value in the long run. Based on Thompson Sampling [62], Posterior Sampling RL (PSRL) [57, 42, 43] explores by greedily optimizing the policy in an MDP which is sampled from the posterior distribution over MDPs.


OnEfficiencyinHierarchicalReinforcement Learning

Neural Information Processing Systems

While this has been demonstrated empirically overtimeinavarietyoftasks,theoretical resultsquantifying thebenefits of such methods are still few and far between. In this paper, we discuss the kind of structure in a Markov decision process which gives rise to efficient HRLmethods.


Successor Uncertainties: Exploration and Uncertainty in Temporal Difference Learning

Neural Information Processing Systems

Posterior sampling for reinforcement learning (PSRL) is an effective method for balancing exploration and exploitation in reinforcement learning. Randomised value functions (RVF) can be viewed as a promising approach to scaling PSRL. However, we show that most contemporary algorithms combining RVF with neural network function approximation do not possess the properties which make PSRL effective, and provably fail in sparse reward problems. Moreover, we find that propagation of uncertainty, a property of PSRL previously thought important for exploration, does not preclude this failure. We use these insights to design Successor Uncertainties (SU), a cheap and easy to implement RVF algorithm that retains key properties of PSRL. SU is highly effective on hard tabular exploration benchmarks. Furthermore, on the Atari 2600 domain, it surpasses human performance on 38 of 49 games tested (achieving a median human normalised score of 2.09), and outperforms its closest RVF competitor, Bootstrapped DQN, on 36 of those.


Conservative Dual Policy Optimization for Efficient Model-Based Reinforcement Learning

Neural Information Processing Systems

Provably efficient Model-Based Reinforcement Learning (MBRL) based on optimism or posterior sampling (PSRL) is ensured to attain the global optimality asymptotically by introducing the complexity measure of the model. However, the complexity might grow exponentially for the simplest nonlinear models, where global convergence is impossible within finite iterations. When the model suffers a large generalization error, which is quantitatively measured by the model complexity, the uncertainty can be large. The sampled model that current policy is greedily optimized upon will thus be unsettled, resulting in aggressive policy updates and over-exploration. In this work, we propose Conservative Dual Policy Optimization (CDPO) that involves a Referential Update and a Conservative Update. The policy is first optimized under a reference model, which imitates the mechanism of PSRL while offering more stability. A conservative range of randomness is guaranteed by maximizing the expectation of model value. Without harmful sampling procedures, CDPO can still achieve the same regret as PSRL. More importantly, CDPO enjoys monotonic policy improvement and global optimality simultaneously.


A Proofs

Neural Information Processing Systems

We lay out the proof in two major steps. From the Performance Difference Lemma B.2, we obtain J (q Combining with (A.4) gives us the iterative improvement bound as follows: J ( π From the Simulation Lemma B.1, we have the bound of We can further know from the construction of the confidence set (c.f. Similar with the proof in A.2, we obtain from the Simulation Lemma B.1 that Enull null null V The claim is thus established. This lemma is widely adopted in RL. Proof can be found in various previous works, e.g.