A Proof of Lemma 1 (s,a) =p = s, A

Neural Information Processing Systems 

Liu et al. [2018] first showed that stationary importance sampling methods can be viewed as Rao-Blackwellization of IS estimator, and claimed that the expectation of the likelihood-ratios conditioned on state and action is equal to the distribution ratio, as stated in Property 1. For completeness, we present a proof of Property 1. Recall that d This gives us the expression " This additional marginalization step over time allows us to consider time-independent distribution ratios. Then, using the law of total expectation, we can write the expectation of the second sum in (4) as: " Assumption 1. Plugging in the final expression from (5) back into (4) gives us " Note that in the infinite horizon setting where L!1and for finite n, (6) becomes " Similarly, by generalizing this pattern it can be observed that on unrolling n times, we will get, 1 " 0 X For all experiments, we utilize the domains and algorithm implementations from Caltech OPE Benchmarking Suite (COBS) library by Voloshin et al. [2019]. We include a brief description of each of these domains below, and a full description of each can be found in the work by Voloshin et al. [2019]. Graph Environment The Graph environment is a two-chain environment with 2L states and 2 actions.