Optimal Baseline Corrections for Off-Policy Contextual Bandits
Gupta, Shashank, Jeunen, Olivier, Oosterhuis, Harrie, de Rijke, Maarten
–arXiv.org Artificial Intelligence
Additive control variates give rise to baseline corrections [16], regression adjustments [15], and doubly robust The off-policy learning paradigm allows for recommender systems estimators [13]. Multiplicative control variates lead to selfnormalised and general ranking applications to be framed as decision-making estimators [32, 59]. Previous work has proven that for problems, where we aim to learn decision policies that optimize off-policy learning tasks, the multiplicative control variates can an unbiased offline estimate of an online reward metric. With unbiasedness be re-framed using an equivalent additive variate [6, 30], enabling comes potentially high variance, and prevalent methods mini-batch optimization methods to be used. We note that the exist to reduce estimation variance. These methods typically make self-normalised estimator is only asymptotically unbiased: a clear use of control variates, either additive (i.e., baseline corrections or disadvantage for evaluation with finite samples. The common problem doubly robust methods) or multiplicative (i.e., self-normalisation). which most existing methods tackle is that of variance reduction Our work unifies these approaches by proposing a single framework in offline value estimation, either for learning or for evaluation. The built on their equivalence in learning scenarios. The foundation common solution is the application of a control variate, either multiplicative of our framework is the derivation of an equivalent baseline or additive [42].
arXiv.org Artificial Intelligence
May-9-2024
- Country:
- Europe (0.93)
- North America > United States
- California > San Francisco County
- San Francisco (0.14)
- North Carolina (0.14)
- California > San Francisco County
- Genre:
- Research Report
- Experimental Study (0.69)
- New Finding (0.46)
- Research Report