Thompson Sampling for Noncompliant Bandits
Multi-Armed Bandit (MAB) (Sutton and Barto, 1998) problems are a class of sequential decision-making problems where an agent seeks to maximize rewards by acting in an unknown stationary environment. The MAB problem is often caricaturized using a set of slot machines with unknown payout distributions. The agent must decide which arm to pull in order to maximize earnings. Because the machines' reward distributions are initially unknown, the bandit must select actions that balance exploration (learning the reward distributions) with exploitation (playing the machine with highest expected reward). Contextual bandits (CB) (Li et al., 2010a) are a slightly modified MAB problem where the reward distributions are conditioned on an observation which is revealed to the agent prior to the selection of an action.
Dec-3-2018
- Country:
- Europe (1.00)
- North America > United States (0.67)
- Genre:
- Research Report > Experimental Study (0.46)
- Industry:
- Technology: