A Proof of Theorem 1 and 2

Neural Information Processing Systems 

Before proving the theorems, we first introduce the confidence set for both transition probability and reward functions. We follow the standard regret analysis framework by Jaksch et al. (2010). Using Hoeffding's inequality, the regret caused by (6) can be upper bounded by For PSRL, since we use fixed episodes, we follow the techniques from Osband et al. (2013) For DORL, we need to prove optimism, i.e, We follow the proof in Agrawal and Jia (2017). We fix some s S and let x = ( s,π (s)) X .