Thompson Sampling for Noncompliant Bandits

Stirn, Andrew, Jebara, Tony

arXiv.org Machine Learning 

Multi-Armed Bandit (MAB) (Sutton and Barto, 1998) problems are a class of sequential decision-making problems where an agent seeks to maximize rewards by acting in an unknown stationary environment. The MAB problem is often caricaturized using a set of slot machines with unknown payout distributions. The agent must decide which arm to pull in order to maximize earnings. Because the machines' reward distributions are initially unknown, the bandit must select actions that balance exploration (learning the reward distributions) with exploitation (playing the machine with highest expected reward). Contextual bandits (CB) (Li et al., 2010a) are a slightly modified MAB problem where the reward distributions are conditioned on an observation which is revealed to the agent prior to the selection of an action.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found