van der Hoeven, Dirk
Online Newton Method for Bandit Convex Optimisation
Fokkema, Hidde, van der Hoeven, Dirk, Lattimore, Tor, Mayo, Jack J.
We introduce a computationally efficient algorithm for zeroth-order bandit convex optimisation and prove that in the adversarial setting its regret is at most $d^{3.5} \sqrt{n} \mathrm{polylog}(n, d)$ with high probability where $d$ is the dimension and $n$ is the time horizon. In the stochastic setting the bound improves to $M d^{2} \sqrt{n} \mathrm{polylog}(n, d)$ where $M \in [d^{-1/2}, d^{-1 / 4}]$ is a constant that depends on the geometry of the constraint set and the desired computational properties.
High-Probability Risk Bounds via Sequential Predictors
van der Hoeven, Dirk, Zhivotovskiy, Nikita, Cesa-Bianchi, Nicolรฒ
Online learning methods yield sequential regret bounds under minimal assumptions and provide in-expectation risk bounds for statistical learning. However, despite the apparent advantage of online guarantees over their statistical counterparts, recent findings indicate that in many important cases, regret bounds may not guarantee tight high-probability risk bounds in the statistical setting. In this work we show that online to batch conversions applied to general online learning algorithms can bypass this limitation. Via a general second-order correction to the loss function defining the regret, we obtain nearly optimal high-probability risk bounds for several classical statistical estimation problems, such as discrete distribution estimation, linear regression, logistic regression, and conditional density estimation. Our analysis relies on the fact that many online learning algorithms are improper, as they are not restricted to use predictors from a given reference class. The improper nature of our estimators enables significant improvements in the dependencies on various problem parameters. Finally, we discuss some computational advantages of our sequential algorithms over their existing batch counterparts.
Trading-Off Payments and Accuracy in Online Classification with Paid Stochastic Experts
van der Hoeven, Dirk, Pike-Burke, Ciara, Qiu, Hao, Cesa-Bianchi, Nicolo
We investigate online classification in the framework of prediction with expert advice where, in each round, the learning agent predicts an unknown binary label by aggregating the stochastic predictions of a number of experts. At the end of each round, the learner observes the true label and updates the function used to aggregate experts. In the variant considered in this work, we assume that at the beginning of a round the learner allocates a payment to each expert which affects the expert's performance in that round. This payment model of expert advice is realistic in many scenarios since human annotators will often only give useful advice if they are adequately compensated, and machine annotators may require more computation to return accurate predictions. Moreover, monetary incentives have been studied in crowdsourcing (Ho et al., 2015, 2016). Although this is a different setting to that considered here, it is natural to study the effect of these payments in online binary classification with stochastic expert advice. Motivated by results in crowdsourcing--e.g., Ho et al. (2016)--we assume that each expert has a productivity function which determines the probability that they predict the label correctly given the payment they received. The productivity function can be different for each expert and is initially unknown to the learner. In each round, the learner pays each expert j = 1,..., K some amount c
Delayed Bandits: When Do Intermediate Observations Help?
Esposito, Emmanuel, Masoudian, Saeed, Qiu, Hao, van der Hoeven, Dirk, Cesa-Bianchi, Nicolรฒ, Seldin, Yevgeny
We study a $K$-armed bandit with delayed feedback and intermediate observations. We consider a model where intermediate observations have a form of a finite state, which is observed immediately after taking an action, whereas the loss is observed after an adversarially chosen delay. We show that the regime of the mapping of states to losses determines the complexity of the problem, irrespective of whether the mapping of actions to states is stochastic or adversarial. If the mapping of states to losses is adversarial, then the regret rate is of order $\sqrt{(K+d)T}$ (within log factors), where $T$ is the time horizon and $d$ is a fixed delay. This matches the regret rate of a $K$-armed bandit with delayed feedback and without intermediate observations, implying that intermediate observations are not helpful. However, if the mapping of states to losses is stochastic, we show that the regret grows at a rate of $\sqrt{\big(K+\min\{|\mathcal{S}|,d\}\big)T}$ (within log factors), implying that if the number $|\mathcal{S}|$ of states is smaller than the delay, then intermediate observations help. We also provide refined high-probability regret upper bounds for non-uniform delays, together with experimental validation of our algorithms.
A Unified Analysis of Nonstochastic Delayed Feedback for Combinatorial Semi-Bandits, Linear Bandits, and MDPs
van der Hoeven, Dirk, Zierahn, Lukas, Lancewicki, Tal, Rosenberg, Aviv, Cesa-Bianchi, Nicolรณ
We derive a new analysis of Follow The Regularized Leader (FTRL) for online learning with delayed bandit feedback. By separating the cost of delayed feedback from that of bandit feedback, our analysis allows us to obtain new results in three important settings. On the one hand, we derive the first optimal (up to logarithmic factors) regret bounds for combinatorial semi-bandits with delay and adversarial Markov decision processes with delay (and known transition functions). On the other hand, we use our analysis to derive an efficient algorithm for linear bandits with delay achieving near-optimal regret bounds. Our novel regret decomposition shows that FTRL remains stable across multiple rounds under mild assumptions on the Hessian of the regularizer.
A Regret-Variance Trade-Off in Online Learning
van der Hoeven, Dirk, Zhivotovskiy, Nikita, Cesa-Bianchi, Nicolรฒ
We consider prediction with expert advice for strongly convex and bounded losses, and investigate trade-offs between regret and "variance" (i.e., squared difference of learner's predictions and best expert predictions). With $K$ experts, the Exponentially Weighted Average (EWA) algorithm is known to achieve $O(\log K)$ regret. We prove that a variant of EWA either achieves a negative regret (i.e., the algorithm outperforms the best expert), or guarantees a $O(\log K)$ bound on both variance and regret. Building on this result, we show several examples of how variance of predictions can be exploited in learning. In the online to batch analysis, we show that a large empirical variance allows to stop the online to batch conversion early and outperform the risk of the best predictor in the class. We also recover the optimal rate of model selection aggregation when we do not consider early stopping. In online prediction with corrupted losses, we show that the effect of corruption on the regret can be compensated by a large variance. In online selective sampling, we design an algorithm that samples less when the variance is large, while guaranteeing the optimal regret bound in expectation. In online learning with abstention, we use a similar term as the variance to derive the first high-probability $O(\log K)$ regret bound in this setting. Finally, we extend our results to the setting of online linear regression.
Beyond Bandit Feedback in Online Multiclass Classification
van der Hoeven, Dirk, Fusco, Federico, Cesa-Bianchi, Nicolรฒ
We study the problem of online multiclass classification in a setting where the learner's feedback is determined by an arbitrary directed graph. While including bandit feedback as a special case, feedback graphs allow a much richer set of applications, including filtering and label efficient classification. We introduce Gappletron, the first online multiclass algorithm that works with arbitrary feedback graphs. For this new algorithm, we prove surrogate regret bounds that hold, both in expectation and with high probability, for a large class of surrogate losses. Our bounds are of order $B\sqrt{\rho KT}$, where $B$ is the diameter of the prediction space, $K$ is the number of classes, $T$ is the time horizon, and $\rho$ is the domination number (a graph-theoretic parameter affecting the amount of exploration). In the full information case, we show that Gappletron achieves a constant surrogate regret of order $B^2K$. We also prove a general lower bound of order $\max\big\{B^2K,\sqrt{T}\big\}$ showing that our upper bounds are not significantly improvable. Experiments on synthetic data show that for various feedback graphs, our algorithm is competitive against known baselines.
Distributed Online Learning for Joint Regret with Communication Constraints
van der Hoeven, Dirk, Hadiji, Hรฉdi, van Erven, Tim
We consider a decentralized online convex optimization (OCO) setting with multiple agents that share information across a network to improve the prediction quality of the network as a whole. Our motivation comes from cases where local computation is cheap, but communication is relatively expensive. This is the case, for instance, in sensor networks, where the energy cost of wireless communication is typically the main bottleneck, and long-distance communication requires much more energy than communication between nearby sensors (Rabbat, Nowak, 2004). It also applies to cases where communication is relatively slow compared to the volume of prediction requests that each agent must serve. For instance, in climate informatics communication may be slow because agents are geographically spread out (McQuade, Monteleoni, 2012, 2017), and in finance or online advertising the rate of prediction requests may be so high that communication is slow by comparison. To model such scenarios, we limit communication in two ways: first, agents can only directly communicate to their neighbors in a communication graph G and, second, the messages that the agents can send are limited to contain at most b bits. We further assume that learning is fully decentralized, so there is no central coordinating agent as in federated learning (Kairouz et al., 2019), and no single agent that dictates the predictions for all other agents as in distributed online optimization for consensus problems (Hosseini et al., 2013; Yan et al., 2013).
MetaGrad: Adaptation using Multiple Learning Rates in Online Learning
van Erven, Tim, Koolen, Wouter M., van der Hoeven, Dirk
We provide a new adaptive method for online convex optimization, MetaGrad, that is robust to general convex losses but achieves faster rates for a broad class of special functions, including exp-concave and strongly convex functions, but also various types of stochastic and non-stochastic functions without any curvature. We prove this by drawing a connection to the Bernstein condition, which is known to imply fast rates in offline statistical learning. MetaGrad further adapts automatically to the size of the gradients. Its main feature is that it simultaneously considers multiple learning rates, which are weighted directly proportional to their empirical performance on the data using a new meta-algorithm. We provide three versions of MetaGrad. The full matrix version maintains a full covariance matrix and is applicable to learning tasks for which we can afford update time quadratic in the dimension. The other two versions provide speed-ups for high-dimensional learning tasks with an update time that is linear in the dimension: one is based on sketching, the other on running a separate copy of the basic algorithm per coordinate. We evaluate all versions of MetaGrad on benchmark online classification and regression tasks, on which they consistently outperform both online gradient descent and AdaGrad.
Exploiting the Surrogate Gap in Online Multiclass Classification
van der Hoeven, Dirk
We present Gaptron, a randomized first-order algorithm for online multiclass classification. In the full information setting we show expected mistake bounds with respect to the logistic loss, hinge loss, and the smooth hinge loss with constant regret, where the expectation is with respect to the learner's randomness. In the bandit classification setting we show that Gaptron is the first linear time algorithm with $O(K\sqrt{T})$ expected regret, where $K$ is the number of classes. Additionally, the expected mistake bound of Gaptron does not depend on the dimension of the feature vector, contrary to previous algorithms with $O(K\sqrt{T})$ regret in the bandit classification setting. We present a new proof technique that exploits the gap between the zero-one loss and surrogate losses rather than exploiting properties such as exp-concavity or mixability, which are traditionally used to prove logarithmic or constant regret bounds.