sup

Neural Information Processing Systems 

A.1 Notation In this appendix, we use the notation dπt(,) to indicate the state-action visitation measure induced by the policy π at time t. We overload the notation dπt() to denote the state-visitation measure induced by the policy π at time t. Likewise, the notations dDt (,) and dDt () indicate the empirical visitation measures in the dataset D. For a function g: X R, the norm kgk, supx X |g(x)|. Before discussing the proofs of the results, we also explain the instantiation of the function class in the tabular setting below. A.2 Imitation gap upper bound on empirical moment matching (Theorem 3.1) Below we restate Theorem 3.1 and provide a proof of this result. The key observation is that since the learner πMM best matches the empirical distribution in the dataset, which is in turn close to the population visitation measure induced by πE, we can expect the visitation measure induced by πE and πMM to be close. This in turns implies that both policies will collect a similar value under any reward function. Precisely characterizing the rates at which these distributions converge to one another results in the final bound. Consider the empirical moment matching learner πMM (eq. TV dπt,dDt (20) where the equation follows by the variational definition of the total variation distance, and where dπt is the state-action visitation measure induced by πE and dDt is the empirical state-action visitation measure in the dataset D. The imitation gap of this policy can be upper bounded by, J(πE) J(πMM) = EπE "H This goes to show that in the tabular setting, MMis equivalent to finding the policy which best matches (in TV-distance) the empirical state-action distribution observed in the dataset.

Similar Docs  Excel Report  more

TitleSimilaritySource
None found