High-Confidence Off-Policy Evaluation
Thomas, Philip S. (University of Massachusetts, Amherst) | Theocharous, Georgios (Adobe Research) | Ghavamzadeh, Mohammad (Adobe Research)
Many reinforcement learning algorithms use trajectories collected from the execution of one or more policies to propose a new policy. Because execution of a bad policy can be costly or dangerous, techniques for evaluating the performance of the new policy without requiring its execution have been of recent interest in industry. Such off-policy evaluation methods, which estimate the performance of a policy using trajectories collected from the execution of other policies, heretofore have not provided confidences regarding the accuracy of their estimates. In this paper we propose an off-policy method for computing a lower confidence bound on the expected return of a policy.
Mar-6-2015
- Country:
- North America > United States > Massachusetts (0.28)
- Genre:
- Research Report (0.46)
- Industry:
- Health & Medicine (0.93)
- Technology: