A Proofs

Neural Information Processing Systems 

We therefore can drop the latter term from our bound. Consider the Cliff problem of Swamy et al. [2021]. Note that under Asymptotic Realizability (Assumption 5.1), there exists a policy We specialize on the two-arm case as it is the most difficult for the learner. When this limit exists, the average over timesteps of moment-matching error is equal to it. We give the off-policy learners 25 demonstration trajectories, each of length 1000.

Similar Docs  Excel Report  more

TitleSimilaritySource
None found