A Off-policy evaluation dual objective We formulate the estimation of the stationary state distribution µ

Neural Information Processing Systems 

Our error analysis relies on similar techniques as the finite-sample analysis in Abbasi-Y adkori et al. For simplicity, we focus on finite-state Markov chains instead of MDPs. Lemma D.2. [Hazewinkel, 2001] Let x U and x This lemma gives us the following direct corollary. Suppose that x and y are two independent samples from U . We then apply Davis-Kahan Theorem [Davis and Kahan, 1970] (see also Theorem 2 in Y u et al.