Regularized Contextual Bandits

Fontaine, Xavier, Berthet, Quentin, Perchet, Vianney

arXiv.org Machine Learning 

In sequential optimization problems, an agent takes successive decisions in order to minimize an unknown loss function. An important class of such problems, nowadays known as bandit problems, has been mathematically formalized by Robbins in his seminal paper (Robbins, 1952). In the so-called stochastic multi-armed bandit problem, an agent chooses to sample (or "pull") among K arms returning random rewards. Only the rewards of the selected arms are revealed to the agent who does not get any additional feedback. Bandits problems naturally model the exploration/exploitation tradeoffs which arise in sequential decision making under uncertainty. Various general algorithms have been proposed to solve this problem, following the work of Lai and Robbins (1985) who obtain a logarithmic regret for their sample-mean based policy. Further bounds have been obtained by Agrawal (1995) and Auer et al. (2002) who developed different versions of the well-known UCB algorithm. The setting of classical stochastic multi-armed bandits is unfortunately too restrictive for real-world applications. The choice of the agent can and should be influenced by additional information (referred to as "context" or "covariate") that is revealed by the environment.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found