Appendix Algorithm 2 Vanilla Pessimistic Value Iteration
–Neural Information Processing Systems
Absolute Constant C, failure probability δ. 2: Initialization: Set bVH+1() 0. 3: for time h= H,H 1,...,1 do 4: Set bQh(,) brh(,) + ( bPh bVh+1)(,) 5: sh,ah, set Γh(sh,ah) = The upper bound of OPE comes from Yin and Wang [2020], Duan et al. [2020] and the Cramer-Rao lower bound comes from Jiang and Li [2016]. Table A shows the statistical optimality for offline policy learning and offline policy evaluation (OPE) in the non-stationary tabular MDPs. By Cauchy-Schwartz inequality, it can be checked that the rate between the two bounds (roughly) deviate by a factor of H (in terms of sample complexity), and this reveals that offline learning is inherently harder than OPE from the statistical aspect. Our analysis of the intrinsic learning bound in Section 4 leverage the key design feature of APVI that bVh+1 only depends on the transition data from time h+ 1 to H while bPh only uses transition pairs at time h. This enables concentration inequalities due the conditional independence.7 Especially, to recover the q VarP(V?h+1) structure to we Next, we use (3) as the intermediate step to crude bounding ||bVh+1 V?h+1|| . Lastly, we can combine this with the extended value difference lemma in Cai et al. [2020] to bound V?1 bV1 and leverage the pessimistic design for bounding bV1 Vbπ1 .
Neural Information Processing Systems
Apr-25-2026, 01:53:39 GMT
- Technology: