A Omitted Details from Main Body

Neural Information Processing Systems 

Thus, the multiplicity of the optimal policies does not break the assumption. A.2 Omitted Algorithms Algorithm 4 Model-Free Sampling Routine Require: In this section, our main goal is to prove Theorem 3.1. The proofs of the supporting lemmas are postponed to Appendix B.1. The regret decomposition in [HZG21], gives us that 15 Lemma B.1. The following lemma resembles Lemma 6.3 [HZG21].