Doubly Robust Off-Policy Value and Gradient Estimation for Deterministic Policies

Neural Information Processing Systems 

In this paper, we propose several doubly robust off-policy value and gradient estimators for deterministic policies in an RL setting.