Schmidt, Mark
SLANG: Fast Structured Covariance Approximations for Bayesian Deep Learning with Natural Gradient
Mishkin, Aaron, Kunstner, Frederik, Nielsen, Didrik, Schmidt, Mark, Khan, Mohammad Emtiyaz
Uncertainty estimation in large deep-learning models is a computationally challenging task, where it is difficult to form even a Gaussian approximation to the posterior distribution. In such situations, existing methods usually resort to a diagonal approximation of the covariance matrix despite the fact that these matrices are known to result in poor uncertainty estimates. To address this issue, we propose a new stochastic, low-rank, approximate natural-gradient (SLANG) method for variational inference in large, deep models. Our method estimates a "diagonal plus low-rank" structure based solely on back-propagated gradients of the network log-likelihood. This requires strictly less gradient computations than methods that compute the gradient of the whole variational objective. Empirical evaluations on standard benchmarks confirm that SLANG enables faster and more accurate estimation of uncertainty than mean-field methods, and performs comparably to state-of-the-art methods.
SLANG: Fast Structured Covariance Approximations for Bayesian Deep Learning with Natural Gradient
Mishkin, Aaron, Kunstner, Frederik, Nielsen, Didrik, Schmidt, Mark, Khan, Mohammad Emtiyaz
Uncertainty estimation in large deep-learning models is a computationally challenging task, where it is difficult to form even a Gaussian approximation to the posterior distribution. In such situations, existing methods usually resort to a diagonal approximation of the covariance matrix despite the fact that these matrices are known to give poor uncertainty estimates. To address this issue, we propose a new stochastic, low-rank, approximate natural-gradient (SLANG) method for variational inference in large deep models. Our method estimates a โdiagonal plus low-rankโ structure based solely on back-propagated gradients of the network log-likelihood. This requires strictly less gradient computations than methods that compute the gradient of the whole variational objective. Empirical evaluations on standard benchmarks confirm that SLANG enables faster and more accurate estimation of uncertainty than mean-field methods, and performs comparably to state-of-the-art methods.
SLANG: Fast Structured Covariance Approximations for Bayesian Deep Learning with Natural Gradient
Mishkin, Aaron, Kunstner, Frederik, Nielsen, Didrik, Schmidt, Mark, Khan, Mohammad Emtiyaz
Uncertainty estimation in large deep-learning models is a computationally challenging task, where it is difficult to form even a Gaussian approximation to the posterior distribution. In such situations, existing methods usually resort to a diagonal approximation of the covariance matrix despite, the fact that these matrices are known to give poor uncertainty estimates. To address this issue, we propose a new stochastic, low-rank, approximate natural-gradient (SLANG) method for variational inference in large, deep models. Our method estimates a "diagonal plus low-rank" structure based solely on back-propagated gradients of the network log-likelihood. This requires strictly less gradient computations than methods that compute the gradient of the whole variational objective. Empirical evaluations on standard benchmarks confirm that SLANG enables faster and more accurate estimation of uncertainty than mean-field methods, and performs comparably to state-of-the-art methods.
Fast and Faster Convergence of SGD for Over-Parameterized Models and an Accelerated Perceptron
Vaswani, Sharan, Bach, Francis, Schmidt, Mark
Modern machine learning focuses on highly expressive models that are able to fit or interpolate the data completely, resulting in zero training loss. For such models, we show that the stochastic gradients of common loss functions satisfy a strong growth condition. Under this condition, we prove that constant step-size stochastic gradient descent (SGD) with Nesterov acceleration matches the convergence rate of the deterministic setting for both convex and strongly-convex functions. In the non-convex setting, this condition implies that SGD can find a first-order stationary point as efficiently as full gradient descent. Under interpolation, we also show that all smooth loss functions with a finite-sum structure satisfy a weaker growth condition. Given this weaker condition, we prove that SGD with a constant step-size attains the deterministic convergence rate in both the strongly-convex and convex settings. Under additional assumptions, the above results enable us to prove an O(1/k^2) mistake bound for $k$ iterations of a stochastic perceptron algorithm using the squared-hinge loss. Finally, we validate our theoretical findings with experiments on synthetic and real datasets.
Combining Bayesian Optimization and Lipschitz Optimization
Ahmed, Mohamed Osama, Vaswani, Sharan, Schmidt, Mark
Bayesian optimization and Lipschitz optimization have developed alternative techniques for optimizing black-box functions. They each exploit a different form of prior about the function. In this work, we explore strategies to combine these techniques for better global optimization. In particular, we propose ways to use the Lipschitz continuity assumption within traditional BO algorithms, which we call Lipschitz Bayesian optimization (LBO). This approach does not increase the asymptotic runtime and in some cases drastically improves the performance (while in the worst case the performance is similar). Indeed, in a particular setting, we prove that using the Lipschitz information yields the same or a better bound on the regret compared to using Bayesian optimization on its own. Moreover, we propose a simple heuristics to estimate the Lipschitz constant, and prove that a growing estimate of the Lipschitz constant is in some sense "harmless". Our experiments on 15 datasets with 4 acquisition functions show that in the worst case LBO performs similar to the underlying BO method while in some cases it performs substantially better. Thompson sampling in particular typically saw drastic improvements (as the Lipschitz information corrected for it's well-known "over-exploration" phenomenon) and its LBO variant often outperformed other acquisition functions.
Does Your Model Know the Digit 6 Is Not a Cat? A Less Biased Evaluation of "Outlier" Detectors
Shafaei, Alireza, Schmidt, Mark, Little, James J.
In the real world, a learning system could receive an input that looks nothing like anything it has seen during training, and this can lead to unpredictable behaviour. We thus need to know whether any given input belongs to the population distribution of the training data to prevent unpredictable behaviour in deployed systems. A recent surge of interest on this problem has led to the development of sophisticated techniques in the deep learning literature. However, due to the absence of a standardized problem formulation or an exhaustive evaluation, it is not evident if we can rely on these methods in practice. What makes this problem different from a typical supervised learning setting is that we cannot model the diversity of out-of-distribution samples in practice. The distribution of outliers used in training may not be the same as the distribution of outliers encountered in the application. Therefore, classical approaches that learn inliers vs. outliers with only two datasets can yield optimistic results. We introduce OD-test, a three-dataset evaluation scheme as a practical and more reliable strategy to assess progress on this problem. The OD-test benchmark provides a straightforward means of comparison for methods that address the out-of-distribution sample detection problem. We present an exhaustive evaluation of a broad set of methods from related areas on image classification tasks. Furthermore, we show that for realistic applications of high-dimensional images, the existing methods have low accuracy. Our analysis reveals areas of strength and weakness of each method.
Linear Convergence of Gradient and Proximal-Gradient Methods Under the Polyak-\L{}ojasiewicz Condition
Karimi, Hamed, Nutini, Julie, Schmidt, Mark
In 1963, Polyak proposed a simple condition that is sufficient to show a global linear convergence rate for gradient descent. This condition is a special case of the \L{}ojasiewicz inequality proposed in the same year, and it does not require strong convexity (or even convexity). In this work, we show that this much-older Polyak-\L{}ojasiewicz (PL) inequality is actually weaker than the main conditions that have been explored to show linear convergence rates without strong convexity over the last 25 years. We also use the PL inequality to give new analyses of randomized and greedy coordinate descent methods, sign-based gradient descent methods, and stochastic gradient methods in the classic setting (with decreasing or constant step-sizes) as well as the variance-reduced setting. We further propose a generalization that applies to proximal-gradient methods for non-smooth optimization, leading to simple proofs of linear convergence of these methods. Along the way, we give simple convergence results for a wide variety of problems in machine learning: least squares, logistic regression, boosting, resilient backpropagation, L1-regularization, support vector machines, stochastic dual coordinate ascent, and stochastic variance-reduced gradient methods.
New Insights into Bootstrapping for Bandits
Vaswani, Sharan, Kveton, Branislav, Wen, Zheng, Rao, Anup, Schmidt, Mark, Abbasi-Yadkori, Yasin
We investigate the use of bootstrapping in the bandit setting. We first show that the commonly used non-parametric bootstrapping (NPB) procedure can be provably inefficient and establish a near-linear lower bound on the regret incurred by it under the bandit model with Bernoulli rewards. We show that NPB with an appropriate amount of forced exploration can result in sub-linear albeit sub-optimal regret. As an alternative to NPB, we propose a weighted bootstrapping (WB) procedure. For Bernoulli rewards, WB with multiplicative exponential weights is mathematically equivalent to Thompson sampling (TS) and results in near-optimal regret bounds. Similarly, in the bandit setting with Gaussian rewards, we show that WB with additive Gaussian weights achieves near-optimal regret. Beyond these special cases, we show that WB leads to better empirical performance than TS for several reward distributions bounded on $[0,1]$. For the contextual bandit setting, we give practical guidelines that make bootstrapping simple and efficient to implement and result in good empirical performance on real-world datasets.
Online Learning Rate Adaptation with Hypergradient Descent
Baydin, Atilim Gunes, Cornish, Robert, Rubio, David Martinez, Schmidt, Mark, Wood, Frank
We introduce a general method for improving the convergence rate of gradient-based optimizers that is easy to implement and works well in practice. We demonstrate the effectiveness of the method in a range of optimization problems by applying it to stochastic gradient descent, stochastic gradient descent with Nesterov momentum, and Adam, showing that it significantly reduces the need for the manual tuning of the initial learning rate for these commonly used algorithms. Our method works by dynamically updating the learning rate during optimization using the gradient with respect to the learning rate of the update rule itself. Computing this "hypergradient" needs little additional computation, requires only one extra copy of the original gradient to be stored in memory, and relies upon nothing more than what is provided by reverse-mode automatic differentiation.
Faster Stochastic Variational Inference using Proximal-Gradient Methods with General Divergence Functions
Khan, Mohammad Emtiyaz, Babanezhad, Reza, Lin, Wu, Schmidt, Mark, Sugiyama, Masashi
Several recent works have explored stochastic gradient methods for variational inference that exploit the geometry of the variational-parameter space. However, the theoretical properties of these methods are not well-understood and these methods typically only apply to conditionally-conjugate models. We present a new stochastic method for variational inference which exploits the geometry of the variational-parameter space and also yields simple closed-form updates even for non-conjugate models. We also give a convergence-rate analysis of our method and many other previous methods which exploit the geometry of the space. Our analysis generalizes existing convergence results for stochastic mirror-descent on non-convex objectives by using a more general class of divergence functions. Beyond giving a theoretical justification for a variety of recent methods, our experiments show that new algorithms derived in this framework lead to state of the art results on a variety of problems. Further, due to its generality, we expect that our theoretical analysis could also apply to other applications.