Pacchiano, Aldo, Chatterji, Niladri S., Bartlett, Peter L.

We present a generalization of the adversarial linear bandits framework, where the underlying losses are kernel functions (with an associated reproducing kernel Hilbert space) rather than linear functions. We study a version of the exponential weights algorithm and bound its regret in this setting. Under conditions on the eigendecay of the kernel we provide a sharp characterization of the regret for this algorithm. When we have polynomial eigendecay $\mu_j \le \mathcal{O}(j^{-\beta})$, we find that the regret is bounded by $\mathcal{R}_n \le \mathcal{O}(n^{\beta/(2(\beta-1))})$; while under the assumption of exponential eigendecay $\mu_j \le \mathcal{O}(e^{-\beta j })$, we get an even tighter bound on the regret $\mathcal{R}_n \le \mathcal{O}(n^{1/2}\log(n)^{1/2})$. We also study the full information setting when the underlying losses are kernel functions and present an adapted exponential weights algorithm and a conditional gradient descent algorithm.

Abernethy, Jacob D., Lee, Chansoo, Tewari, Ambuj

We focus on the adversarial multi-armed bandit problem. The EXP3 algorithm of Auer et al. (2003) was shown to have a regret bound of $O(\sqrt{T N \log N})$, where $T$ is the time horizon and $N$ is the number of available actions (arms). More recently, Audibert and Bubeck (2009) improved the bound by a logarithmic factor via an entirely different method. In the present work, we provide a new set of analysis tools, using the notion of convex smoothing, to provide several novel algorithms with optimal guarantees. First we show that regularization via the Tsallis entropy matches the minimax rate of Audibert and Bubeck (2009) with an even tighter constant; it also fully generalizes EXP3. Second we show that a wide class of perturbation methods lead to near-optimal bandit algorithms as long as a simple condition on the perturbation distribution $\mathcal{D}$ is met: one needs that the hazard function of $\mathcal{D}$ remain bounded. The Gumbel, Weibull, Frechet, Pareto, and Gamma distributions all satisfy this key property; interestingly, the Gaussian and Uniform distributions do not.

Kotłowski, Wojciech, Warmuth, Manfred K.

Most of machine learning deals with vector parameters. Ideally we would like to take higher order information into account and make use of matrix or even tensor parameters. However the resulting algorithms are usually inefficient. Here we address on-line learning with matrix parameters. It is often easy to obtain online algorithm with good generalization performance if you eigendecompose the current parameter matrix in each trial (at a cost of $O(n^3)$ per trial). Ideally we want to avoid the decompositions and spend $O(n^2)$ per trial, i.e. linear time in the size of the matrix data. There is a core trade-off between the running time and the generalization performance, here measured by the regret of the on-line algorithm (total gain of the best off-line predictor minus the total gain of the on-line algorithm). We focus on the key matrix problem of rank $k$ Principal Component Analysis in $\mathbb{R}^n$ where $k \ll n$. There are $O(n^3)$ algorithms that achieve the optimum regret but require eigendecompositions. We develop a simple algorithm that needs $O(kn^2)$ per trial whose regret is off by a small factor of $O(n^{1/4})$. The algorithm is based on the Follow the Perturbed Leader paradigm. It replaces full eigendecompositions at each trial by the problem finding $k$ principal components of the current covariance matrix that is perturbed by Gaussian noise.

This work addresses the problem of regret minimization in non-stochastic multi-armed bandit problems, focusing on performance guarantees that hold with high probability. Such results are rather scarce in the literature since proving them requires a large deal of technical effort and significant modifications to the standard, more intuitive algorithms that come only with guarantees that hold on expectation. One of these modifications is forcing the learner to sample arms from the uniform distribution at least $\Omega(\sqrt{T})$ times over $T$ rounds, which can adversely affect performance if many of the arms are suboptimal. While it is widely conjectured that this property is essential for proving high-probability regret bounds, we show in this paper that it is possible to achieve such strong results without this undesirable exploration component. Our result relies on a simple and intuitive loss-estimation strategy called Implicit eXploration (IX) that allows a remarkably clean analysis. To demonstrate the flexibility of our technique, we derive several improved high-probability bounds for various extensions of the standard multi-armed bandit framework.Finally, we conduct a simple experiment that illustrates the robustness of our implicit exploration technique.

Lykouris, Thodoris, Mirrokni, Vahab, Leme, Renato Paes

We study "adversarial scaling", a multi-armed bandit model where rewards have a stochastic and an adversarial component. Our model captures display advertising where the "click-through-rate" can be decomposed to a (fixed across time) arm-quality component and a non-stochastic user-relevance component (fixed across arms). Despite the relative stochasticity of our model, we demonstrate two settings where most bandit algorithms suffer. On the positive side, we show that two algorithms, one from the action elimination and one from the mirror descent family are adaptive enough to be robust to adversarial scaling. Our results shed light on the robustness of adaptive parameter selection in stochastic bandits, which may be of independent interest.