Goto

Collaborating Authors

 exponential weight


Learning the Learning Rate for Prediction with Expert Advice

Neural Information Processing Systems

Most standard algorithms for prediction with expert advice depend on a parameter called the learning rate. This learning rate needs to be large enough to fit the data well, but small enough to prevent overfitting. For the exponential weights algorithm, a sequence of prior work has established theoretical guarantees for higher and higher data-dependent tunings of the learning rate, which allow for increasingly aggressive learning. But in practice such theoretical tunings often still perform worse (as measured by their regret) than ad hoc tuning with an even higher learning rate. To close the gap between theory and practice we introduce an approach to learn the learning rate. Up to a factor that is at most (poly)logarithmic in the number of experts and the inverse of the learning rate, our method performs as well as if we would know the empirically best learning rate from a large range that includes both conservative small values and values that are much higher than those for which formal guarantees were previously available. Our method employs a grid of learning rates, yet runs in linear time regardless of the size of the grid.


Learning the Learning Rate for Prediction with Expert Advice

Neural Information Processing Systems

Most standard algorithms for prediction with expert advice depend on a parameter called the learning rate. This learning rate needs to be large enough to fit the data well, but small enough to prevent overfitting. For the exponential weights algorithm, a sequence of prior work has established theoretical guarantees for higher and higher data-dependent tunings of the learning rate, which allow for increasingly aggressive learning. But in practice such theoretical tunings often still perform worse (as measured by their regret) than ad hoc tuning with an even higher learning rate. To close the gap between theory and practice we introduce an approach to learn the learning rate. Up to a factor that is at most (poly)logarithmic in the number of experts and the inverse of the learning rate, our method performs as well as if we would know the empirically best learning rate from a large range that includes both conservative small values and values that are much higher than those for which formal guarantees were previously available. Our method employs a grid of learning rates, yet runs in linear time regardless of the size of the grid.


Fast Routing under Uncertainty: Adaptive Learning in Congestion Games via Exponential Weights

Neural Information Processing Systems

We examine an adaptive learning framework for nonatomic congestion games where the players' cost functions may be subject to exogenous fluctuations (e.g., due to disturbances in the network, variations in the traffic going through a link). In this setting, the popular multiplicative/ exponential weights algorithm enjoys an \mathcal{O}(1/\sqrt{T}) equilibrium convergence rate; however, this rate is suboptimal in static environments---i.e., when the network is not subject to randomness. In this static regime, accelerated algorithms achieve an \mathcal{O}(1/T {2}) convergence speed, but they fail to converge altogether in stochastic problems. To fill this gap, we propose a novel, adaptive exponential weights method---dubbed AdaWeight---that seamlessly interpolates between the \mathcal{O}(1/T {2}) and \mathcal{O}(1/\sqrt{T}) rates in the static and stochastic regimes respectively. Importantly, this "best-of-both-worlds" guarantee does not require any prior knowledge of the problem's parameters or tuning by the optimizer; in addition, the method's convergence speed depends subquadratically on the size of the network (number of vertices and edges), so it scales gracefully to large, real-life urban networks.


Optimization, Learning, and Games with Predictable Sequences

Neural Information Processing Systems

We provide several applications of Optimistic Mirror Descent, an online learning algorithm based on the idea of predictable sequences. First, we recover the Mirror Prox algorithm for offline optimization, prove an extension to Hölder-smooth functions, and apply the results to saddle-point type problems. Next, we prove that a version of Optimistic Mirror Descent (which has a close relation to the Exponential Weights algorithm) can be used by two strongly-uncoupled players in a finite zero-sum matrix game to converge to the minimax equilibrium at the rate of O((log T) T).


Learning the Learning Rate for Prediction with Expert Advice

Neural Information Processing Systems

Most standard algorithms for prediction with expert advice depend on a parameter called the learning rate. This learning rate needs to be large enough to fit the data well, but small enough to prevent overfitting. For the exponential weights algorithm, a sequence of prior work has established theoretical guarantees for higher and higher data-dependent tunings of the learning rate, which allow for increasingly aggressive learning. But in practice such theoretical tunings often still perform worse (as measured by their regret) than ad hoc tuning with an even higher learning rate. To close the gap between theory and practice we introduce an approach to learn the learning rate. Up to a factor that is at most (poly)logarithmic in the number of experts and the inverse of the learning rate, our method performs as well as if we would know the empirically best learning rate from a large range that includes both conservative small values and values that are much higher than those for which formal guarantees were previously available. Our method employs a grid of learning rates, yet runs in linear time regardless of the size of the grid.


Exponential weight averaging as damped harmonic motion

Patsenker, Jonathan, Li, Henry, Kluger, Yuval

arXiv.org Artificial Intelligence

The exponential moving average (EMA) is a commonly used statistic for providing stable estimates of stochastic quantities in deep learning optimization. Recently, EMA has seen considerable use in generative models, where it is computed with respect to the model weights, and significantly improves the stability of the inference model during and after training. While the practice of weight averaging at the end of training is well-studied and known to improve estimates of local optima, the benefits of EMA over the course of training is less understood. In this paper, we derive an explicit connection between EMA and a damped harmonic system between two particles, where one particle (the EMA weights) is drawn to the other (the model weights) via an idealized zero-length spring. We then leverage this physical analogy to analyze the effectiveness of EMA, and propose an improved training algorithm, which we call BELAY. Finally, we demonstrate theoretically and empirically several advantages enjoyed by BELAY over standard EMA.


Boosting Nystr\"{o}m Method

Hamm, Keaton, Lu, Zhaoying, Ouyang, Wenbo, Zhang, Hao Helen

arXiv.org Artificial Intelligence

The Nystr\"{o}m method is an effective tool to generate low-rank approximations of large matrices, and it is particularly useful for kernel-based learning. To improve the standard Nystr\"{o}m approximation, ensemble Nystr\"{o}m algorithms compute a mixture of Nystr\"{o}m approximations which are generated independently based on column resampling. We propose a new family of algorithms, boosting Nystr\"{o}m, which iteratively generate multiple ``weak'' Nystr\"{o}m approximations (each using a small number of columns) in a sequence adaptively - each approximation aims to compensate for the weaknesses of its predecessor - and then combine them to form one strong approximation. We demonstrate that our boosting Nystr\"{o}m algorithms can yield more efficient and accurate low-rank approximations to kernel matrices. Improvements over the standard and ensemble Nystr\"{o}m methods are illustrated by simulation studies and real-world data analysis.


WildWood: a new Random Forest algorithm

Gaïffas, Stéphane, Merad, Ibrahim, Yu, Yiyang

arXiv.org Machine Learning

We introduce WildWood (WW), a new ensemble algorithm for supervised learning of Random Forest (RF) type. While standard RF algorithms use bootstrap out-of-bag samples to compute out-of-bag scores, WW uses these samples to produce improved predictions given by an aggregation of the predictions of all possible subtrees of each fully grown tree in the forest. This is achieved by aggregation with exponential weights computed over out-of-bag samples, that are computed exactly and very efficiently thanks to an algorithm called context tree weighting. This improvement, combined with a histogram strategy to accelerate split finding, makes WW fast and competitive compared with other well-established ensemble methods, such as standard RF and extreme gradient boosting algorithms.


Exploration by Optimisation in Partial Monitoring

Lattimore, Tor, Szepesvari, Csaba

arXiv.org Machine Learning

We provide a simple and efficient algorithm for adversarial $k$-action $d$-outcome non-degenerate locally observable partial monitoring games for which the $n$-round minimax regret is bounded by $3(d+1) k^{3/2} \sqrt{8n \log(k)}$, matching the best known information-theoretic upper bounds.


Exponential Weights on the Hypercube in Polynomial Time

Putta, Sudeep Raja

arXiv.org Machine Learning

We study a general online linear optimization problem(OLO). At each round, a subset of objects from a fixed universe of $n$ objects is chosen, and a linear cost associated with the chosen subset is incurred. We use \textit{regret} as a measure of performance of our algorithms. Regret is the difference between the total cost incurred over all iterations and the cost of the best fixed subset in hindsight. We consider \textit{Full Information}, \textit{Semi-Bandit} and \textit{Bandit} feedback for this problem. Using characteristic vectors of the subsets, this problem reduces to OLO on the $\{0,1\}^n$ hypercube. The Exp2 algorithm and its bandit variants are commonly used strategies for this problem. It was previously unknown if it is possible to run Exp2 on the hypercube in polynomial time. In this paper, we present a polynomial time algorithm called \textit{PolyExp} for OLO on the hypercube. We show that our algorithm is equivalent to both Exp2 on $\{0,1\}^n$ as well as Online Mirror Descent(OMD) with Entropic regularization on $[0,1]^n$ and Bernoulli Sampling. Under $L_\infty$ adversarial losses, in the Full Information case and Semi-Bandit case, analyzing Exp2 directly, gives an expected regret bound of $O(n^{3/2}\sqrt{T})$, whereas PolyExp yields a regret of $O(n\sqrt{T})$. In the Bandit case, analyzing Exp2 directly, gives an expected regret bound of $O(n^{2}\sqrt{T})$, whereas PolyExp yields a regret of $O(n^{3/2}\sqrt{T})$. This implies an improvement on Exp2's regret bound for these settings because of the equivalence. Moreover, PolyExp is minimax optimal in all the three settings as its regret bounds match the $L_\infty$ lowerbounds in \cite{audibert2011minimax}. Finally, we show how to use PolyExp on the $\{-1,+1\}^n$ hypercube, solving an open problem in \cite{bubeck2012towards}.