Cowan, Wesley, Katehakis, Michael N.

We consider the \mnk{classical} problem of a controller activating (or sampling) sequentially from a finite number of $N \geq 2$ populations, specified by unknown distributions. Over some time horizon, at each time $n = 1, 2, \ldots$, the controller wishes to select a population to sample, with the goal of sampling from a population that optimizes some "score" function of its distribution, e.g., maximizing the expected sum of outcomes or minimizing variability. We define a class of \textit{Uniformly Fast (UF)} sampling policies and show, under mild regularity conditions, that there is an asymptotic lower bound for the expected total number of sub-optimal population activations. Then, we provide sufficient conditions under which a UCB policy is UF and asymptotically optimal, since it attains this lower bound. Explicit solutions are provided for a number of examples of interest, including general score functionals on unconstrained Pareto distributions (of potentially infinite mean), and uniform distributions of unknown support. Additional results on bandits of Normal distributions are also provided.

Cesa-Bianchi, Nicolò, Kakade, Sham

We provide the first algorithm for online bandit linear optimization whose regret after T rounds is of order sqrt{Td ln N} on any finite class X of N actions in d dimensions, and of order d*sqrt{T} (up to log factors) when X is infinite. These bounds are not improvable in general. The basic idea utilizes tools from convex geometry to construct what is essentially an optimal exploration basis. We also present an application to a model of linear bandits with expert advice. Interestingly, these results show that bandit linear optimization with expert advice in d dimensions is no more difficult (in terms of the achievable regret) than the online d-armed bandit problem with expert advice (where EXP4 is optimal).

Duff, Michael O., Barto, Andrew G.

A Bayesian formulation of the problem leads to a clear concept of a solution whose computation, however, appears to entail an examination of an intractably-large number of hyperstates. This paper hassuggested extending the Gittins index approach (which applies with great power and elegance to the special class of multi-armed bandit processes) to general adaptive MDP's. The hope has been that if certain salient features of the value of information could be captured, even approximately, then one could be led to a reasonable method for avoiding certain defects of certainty-equivalence approaches (problems with identifiability, "metastability"). Obviously, positive evidence, in the form of empirical results from simulation experiments, would lend support to these ideas-work along these lines is underway. Local bandit approximation is but one approximate computational approach for problems of optimal learning and dual control. Most prominent in the literature of control theory is the "wide-sense" approach of [Bar-Shalom & Tse, 1976], which utilizes localquadratic approximations about nominal state/control trajectories. For certain problems, this method has demonstrated superior performance compared to a certainty-equivalence approach, but it is computationally very intensive and unwieldy, particularly for problems with controller dimension greater than one. One could revert to the view of the bandit problem, or general adaptive MDP, as simply a very large MDP defined over hyperstates, and then consider a some- Local Bandit Approximationfor Optimal Learning Problems 1025 what direct approach in which one performs approximate dynamic programming with function approximation over this domain-details of function-approximation, feature-selection, and "training" all become important design issues.

Combes, Richard, Magureanu, Stefan, Proutiere, Alexandre

This paper introduces and addresses a wide class of stochastic bandit problems where the function mapping the arm to the corresponding reward exhibits some known structural properties. Most existing structures (e.g. linear, lipschitz, unimodal, combinatorial, dueling,...) are covered by our framework. We derive an asymptotic instance-specific regret lower bound for these problems, and develop OSSB, an algorithm whose regret matches this fundamental limit. OSSB is not based on the classical principle of ``optimism in the face of uncertainty'' or on Thompson sampling, and rather aims at matching the minimal exploration rates of sub-optimal arms as characterized in the derivation of the regret lower bound. We illustrate the efficiency of OSSB using numerical experiments in the case of the linear bandit problem and show that OSSB outperforms existing algorithms, including Thompson sampling

Tao, Chao, Blanco, Saúl, Peng, Jian, Zhou, Yuan

We consider the thresholding bandit problem, whose goal is to find arms of mean rewards above a given threshold $\theta$, with a fixed budget of $T$ trials. We introduce LSA, a new, simple and anytime algorithm that aims to minimize the aggregate regret (or the expected number of mis-classified arms). We prove that our algorithm is instance-wise asymptotically optimal. We also provide comprehensive empirical results to demonstrate the algorithm's superior performance over existing algorithms under a variety of different scenarios. Papers published at the Neural Information Processing Systems Conference.