Unbiased Estimation of the Value of an Optimized Policy

Portugaly, Elon, Pfeiffer, Joseph J. III

Jun-7-2018–arXiv.org Machine Learning

Randomized trials, also known as A/B tests, are used to select between two policies: a control and a treatment. Given a corresponding set of features, we can ideally learn an optimized policy P that maps the A/B test data features to action space and optimizes reward. However, although A/B testing provides an unbiased estimator for the value of deploying B (i.e., switching from policy A to B), direct application of those samples to learn the the optimized policy P generally does not provide an unbiased estimator of the value of P as the samples were observed when constructing P. In situations where the cost and risks associated of deploying a policy are high, such an unbiased estimator is highly desirable. We present a procedure for learning optimized policies and getting unbiased estimates for the value of deploying them. We wrap any policy learning procedure with a bagging process and obtain out-of-bag policy inclusion decisions for each sample. We then prove that inverse-propensity-weighting effect estimator is unbiased when applied to the optimized subset. Likewise, we apply the same idea to obtain out-of-bag unbiased per-sample value estimate of the measurement that is independent of the randomized treatment, and use these estimates to build an unbiased doubly-robust effect estimator. Lastly, we empirically shown that even when the average treatment effect is negative we can find a positive optimized policy.

artificial intelligence, estimator, machine learning, (17 more...)

arXiv.org Machine Learning

Jun-7-2018

arXiv.org PDF

Add feedback

Genre:
- Research Report > Experimental Study (1.00)

Technology:
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.46)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found