A Upper Bound with Gap-dependent Analysis

Neural Information Processing Systems 

We begin with the proof of thresholding technique. In this section, definitions are restated for completeness. A.1 Definitions We first restate the notations. An offline learning algorithm with output policy is pessimistic if with probability at least 1, the following arguments hold, 1. For a given successful pessimistic algorithm execution instance, where the arguments in Definition A.1 are simultaneously satisfied, we call M = (S, A,H,P,p At the same time, V, Q are the corresponding value functions and Q functions.