A Proof of Theorem 1 and 2

Aug-16-2025, 14:51:50 GMT–Neural Information Processing Systems

Before proving the theorems, we first introduce the confidence set for both transition probability and reward functions. We follow the standard regret analysis framework by Jaksch et al. (2010). Using Hoeffding's inequality, the regret caused by (6) can be upper bounded by For PSRL, since we use fixed episodes, we follow the techniques from Osband et al. (2013) For DORL, we need to prove optimism, i.e, We follow the proof in Agrawal and Jia (2017). We fix some s S and let x = ( s,π (s)) X .

artificial intelligence, mdp, probability, (17 more...)

Neural Information Processing Systems

Aug-16-2025, 14:51:50 GMT

Conferences PDF

Add feedback

Technology:
- Information Technology > Artificial Intelligence > Representation & Reasoning (0.34)

Duplicate Docs Excel Report

Title
d3b1fb02964aa64e257f9f26a31f72cf-Supplemental.pdf

Similar Docs Excel Report more

Title	Similarity	Source
None found