h eft(s,a) f (|s,a) 1 i +Eρπt 1 h eft(s,a) f (|s,a) 1 i, (A.1) where (t): = Es ζ V

Neural Information Processing Systems 

From the Posterior Sampling Lemma, we know that ifψ is the distribution off, then for any sigma-algebraσ(Ht)-measurablefunctiong, E[g(f)|Ht]=E[g(ft)|Ht]. We can further know from the construction of the confidence set (c.f. This lemma is widely adopted in RL. Proof can be found in various previous works, e.g. Prior work that shares similarities with ours contains DPI [59]and GPS [31,39]as dual policyoptimization procedures areadopted.