Breaking the Curse of Horizon: Infinite-Horizon Off-Policy Estimation

Liu, Qiang, Li, Lihong, Tang, Ziyang, Zhou, Dengyong

Dec-31-2018–Neural Information Processing Systems

We consider the off-policy estimation problem of estimating the expected reward of a target policy using samples collected by a different behavior policy. Importance sampling (IS) has been a key technique to derive (nearly) unbiased estimators, but is known to suffer from an excessively high variance in long-horizon problems. In the extreme case of in infinite-horizon problems, the variance of an IS-based estimator may even be unbounded. In this paper, we propose a new off-policy estimation method that applies IS directly on the stationary state-visitation distributions to avoid the exploding variance issue faced by existing estimators.Our key contribution is a novel approach to estimating the density ratio of two stationary distributions, with trajectories sampled from only the behavior distribution. We develop a mini-max loss function for the estimation problem, and derive a closed-form solution for the case of RKHS. We support our method with both theoretical and empirical analyses.

machine learning, reinforcement learning, trajectory, (15 more...)

Neural Information Processing Systems

Dec-31-2018

Conferences PDF

Add feedback

Country:
- North America > United States > Texas > Travis County > Austin (0.14)

Genre:
- Research Report (0.34)

Industry:
- Transportation (0.47)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Reinforcement Learning (0.94)
  - Representation & Reasoning (1.00)

Duplicate Docs Excel Report

Title
Breaking the Curse of Horizon: Infinite-Horizon Off-Policy Estimation
Breaking the Curse of Horizon: Infinite-Horizon Off-Policy Estimation

Similar Docs Excel Report more

Title	Similarity	Source
None found