Taming the Monster: A Fast and Simple Algorithm for Contextual Bandits

Agarwal, Alekh, Hsu, Daniel, Kale, Satyen, Langford, John, Li, Lihong, Schapire, Robert E.

arXiv.org Machine Learning 

In the contextual bandit problem, an agent collects rewards for actions taken over a sequence of rounds; in each round, the agent chooses an action to take on the basis of (i) context (or features) for the current round, as well as (ii) feedback, in the form of rewards, obtained in previous rounds. The feedback is incomplete: in any given round, the agent observes the reward only for the chosen action; the agent does not observe the reward for other actions. Contextual bandit problems are found in many important applications such as online recommendation and clinical trials, and represent a natural halfway point between supervised learning and reinforcement learning. The use of features to encode context is inherited from supervised machine learning, while exploration is necessary for good performance as in reinforcement learning. The choice of exploration distribution on actions is important. The strongest known results(Auer et al., 2002; McMahan and Streeter, 2009; Beygelzimer et al., 2011) provide algorithms that carefully control the exploration distribution to achieve an optimal regret after T rounds of () O KT log( Π /δ), with probability at least 1 δ, relative to a set of policies Π A

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found