Thompson Sampling for Multinomial Logit Contextual Bandits

Min-hwan Oh, Garud Iyengar

Neural Information Processing Systems 

The confidence set is updated based on the revenue feedback which is revealed after an arm is pulled. TS assumes a prior distribution over the parameters defining the reward distribution. At each step, a parameter value is sampled from the posterior distribution, and an optimal arm corresponding to a sampled parameter is pulled.

Similar Docs  Excel Report  more

TitleSimilaritySource
None found