Optimal Baseline Corrections for Off-Policy Contextual Bandits

Gupta, Shashank, Jeunen, Olivier, Oosterhuis, Harrie, de Rijke, Maarten

May-9-2024–arXiv.org Artificial Intelligence

Additive control variates give rise to baseline corrections [16], regression adjustments [15], and doubly robust The off-policy learning paradigm allows for recommender systems estimators [13]. Multiplicative control variates lead to selfnormalised and general ranking applications to be framed as decision-making estimators [32, 59]. Previous work has proven that for problems, where we aim to learn decision policies that optimize off-policy learning tasks, the multiplicative control variates can an unbiased offline estimate of an online reward metric. With unbiasedness be re-framed using an equivalent additive variate [6, 30], enabling comes potentially high variance, and prevalent methods mini-batch optimization methods to be used. We note that the exist to reduce estimation variance. These methods typically make self-normalised estimator is only asymptotically unbiased: a clear use of control variates, either additive (i.e., baseline corrections or disadvantage for evaluation with finite samples. The common problem doubly robust methods) or multiplicative (i.e., self-normalisation). which most existing methods tackle is that of variance reduction Our work unifies these approaches by proposing a single framework in offline value estimation, either for learning or for evaluation. The built on their equivalence in learning scenarios. The foundation common solution is the application of a control variate, either multiplicative of our framework is the derivation of an equivalent baseline or additive [42].

artificial intelligence, machine learning, variance, (15 more...)

arXiv.org Artificial Intelligence

May-9-2024

arXiv.org PDF

Add feedback

Country:
- Europe (0.93)
- North America > United States
  - California > San Francisco County
    - San Francisco (0.14)
  - North Carolina (0.14)

Genre:
- Research Report
  - Experimental Study (0.69)
  - New Finding (0.46)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning
    - Learning Graphical Models > Undirected Networks
      - Markov Models (0.46)
    - Statistical Learning (1.00)
  - Representation & Reasoning (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found