Goto

Collaborating Authors

 cab problem


FromFinitetoCountable-ArmedBandits

Neural Information Processing Systems

Inaddition, there is a fixed distribution over types which sets the proportion of each type in the population of arms. The decision maker is oblivious to the type of any arm and to the aforementioned distribution over types, but perfectly knows the total number of types occurring in the population of arms.



From Finite to Countable-Armed Bandits

arXiv.org Machine Learning

We consider a stochastic bandit problem with countably many arms that belong to a finite set of types, each characterized by a unique mean reward. In addition, there is a fixed distribution over types which sets the proportion of each type in the population of arms. The decision maker is oblivious to the type of any arm and to the aforementioned distribution over types, but perfectly knows the total number of types occurring in the population of arms. We propose a fully adaptive online learning algorithm that achieves O(log n) distribution-dependent expected cumulative regret after any number of plays n, and show that this order of regret is best possible. The analysis of our algorithm relies on newly discovered concentration and convergence properties of optimism-based policies like UCB in finite-armed bandit problems with "zero gap," which may be of independent interest.


Exploring Offline Policy Evaluation for the Continuous-Armed Bandit Problem

arXiv.org Machine Learning

In the canonical multi-armed bandit (MAB) problem a gambler stands in front of a row of slot machines, each with a (potentially) different payoff. It is up to the gambler to decide in sequence which machine to play and, during the course of sequentially playing the machines, she aims to make as much profit as possible by simultaneously learning from the previous observations and using the gained knowledge to steer future actions (Berry and Fristedt, 1985; Whittle, 1980). The gambler needs to pick a strategy that dictates which arm to play next given the previous observations. The problem of finding such a strategy is complicated since at each interaction the gambler only observes the outcomes of the machine she played, and she will never know the outcomes of the other possible courses of action at that moment in time. This so-called omission of counterfactuals (Li, Chu, Langford, and Wang, 2011) - not being able to gain knowledge about all the possible outcomes - gives rise to the exploration versus exploitation tradeoff (Berry and Fristedt, 1985): at each time point an action can either be geared at gaining more knowledge regarding the machines she is uncertain about (exploration), or it can be geared at using the knowledge gained in earlier interactions by playing machines with a high expected payoff (exploitation).