AITopics | Schmidt, Mark

Collaborating Authors

Schmidt, Mark

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

SLANG: Fast Structured Covariance Approximations for Bayesian Deep Learning with Natural Gradient

Mishkin, Aaron, Kunstner, Frederik, Nielsen, Didrik, Schmidt, Mark, Khan, Mohammad Emtiyaz

Neural Information Processing SystemsDec-31-2018

Uncertainty estimation in large deep-learning models is a computationally challenging task, where it is difficult to form even a Gaussian approximation to the posterior distribution. In such situations, existing methods usually resort to a diagonal approximation of the covariance matrix despite the fact that these matrices are known to result in poor uncertainty estimates. To address this issue, we propose a new stochastic, low-rank, approximate natural-gradient (SLANG) method for variational inference in large, deep models. Our method estimates a "diagonal plus low-rank" structure based solely on back-propagated gradients of the network log-likelihood. This requires strictly less gradient computations than methods that compute the gradient of the whole variational objective. Empirical evaluations on standard benchmarks confirm that SLANG enables faster and more accurate estimation of uncertainty than mean-field methods, and performs comparably to state-of-the-art methods.

approximation, deep learning, neural network, (18 more...)

Neural Information Processing Systems

Country: North America > Canada > British Columbia (0.14)

Genre: Research Report (0.51)

Industry: Health & Medicine (0.69)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

SLANG: Fast Structured Covariance Approximations for Bayesian Deep Learning with Natural Gradient

Mishkin, Aaron, Kunstner, Frederik, Nielsen, Didrik, Schmidt, Mark, Khan, Mohammad Emtiyaz

Neural Information Processing SystemsDec-31-2018

Uncertainty estimation in large deep-learning models is a computationally challenging task, where it is difficult to form even a Gaussian approximation to the posterior distribution. In such situations, existing methods usually resort to a diagonal approximation of the covariance matrix despite the fact that these matrices are known to give poor uncertainty estimates. To address this issue, we propose a new stochastic, low-rank, approximate natural-gradient (SLANG) method for variational inference in large deep models. Our method estimates a “diagonal plus low-rank” structure based solely on back-propagated gradients of the network log-likelihood. This requires strictly less gradient computations than methods that compute the gradient of the whole variational objective. Empirical evaluations on standard benchmarks confirm that SLANG enables faster and more accurate estimation of uncertainty than mean-field methods, and performs comparably to state-of-the-art methods.

approximation, deep learning, neural network, (18 more...)

Neural Information Processing Systems

Country: North America > Canada > British Columbia (0.14)

Genre: Research Report (0.51)

Industry: Health & Medicine (0.69)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

SLANG: Fast Structured Covariance Approximations for Bayesian Deep Learning with Natural Gradient

Mishkin, Aaron, Kunstner, Frederik, Nielsen, Didrik, Schmidt, Mark, Khan, Mohammad Emtiyaz

arXiv.org Artificial IntelligenceNov-11-2018

Uncertainty estimation in large deep-learning models is a computationally challenging task, where it is difficult to form even a Gaussian approximation to the posterior distribution. In such situations, existing methods usually resort to a diagonal approximation of the covariance matrix despite, the fact that these matrices are known to give poor uncertainty estimates. To address this issue, we propose a new stochastic, low-rank, approximate natural-gradient (SLANG) method for variational inference in large, deep models. Our method estimates a "diagonal plus low-rank" structure based solely on back-propagated gradients of the network log-likelihood. This requires strictly less gradient computations than methods that compute the gradient of the whole variational objective. Empirical evaluations on standard benchmarks confirm that SLANG enables faster and more accurate estimation of uncertainty than mean-field methods, and performs comparably to state-of-the-art methods.

approximation, deep learning, neural network, (17 more...)

arXiv.org Artificial Intelligence

1811.04504

Country: North America > Canada > British Columbia (0.14)

Genre: Research Report (1.00)

Industry: Health & Medicine (0.69)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Fast and Faster Convergence of SGD for Over-Parameterized Models and an Accelerated Perceptron

Vaswani, Sharan, Bach, Francis, Schmidt, Mark

arXiv.org Machine LearningOct-16-2018

Modern machine learning focuses on highly expressive models that are able to fit or interpolate the data completely, resulting in zero training loss. For such models, we show that the stochastic gradients of common loss functions satisfy a strong growth condition. Under this condition, we prove that constant step-size stochastic gradient descent (SGD) with Nesterov acceleration matches the convergence rate of the deterministic setting for both convex and strongly-convex functions. In the non-convex setting, this condition implies that SGD can find a first-order stationary point as efficiently as full gradient descent. Under interpolation, we also show that all smooth loss functions with a finite-sum structure satisfy a weaker growth condition. Given this weaker condition, we prove that SGD with a constant step-size attains the deterministic convergence rate in both the strongly-convex and convex settings. Under additional assumptions, the above results enable us to prove an O(1/k^2) mistake bound for $k$ iterations of a stochastic perceptron algorithm using the squared-hinge loss. Finally, we validate our theoretical findings with experiments on synthetic and real datasets.

artificial intelligence, growth condition, neural network, (18 more...)

arXiv.org Machine Learning

1810.07288

Country: Europe > Sweden (0.14)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.97)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Perceptrons (0.92)

Add feedback

Combining Bayesian Optimization and Lipschitz Optimization

Ahmed, Mohamed Osama, Vaswani, Sharan, Schmidt, Mark

arXiv.org Machine LearningOct-9-2018

Bayesian optimization and Lipschitz optimization have developed alternative techniques for optimizing black-box functions. They each exploit a different form of prior about the function. In this work, we explore strategies to combine these techniques for better global optimization. In particular, we propose ways to use the Lipschitz continuity assumption within traditional BO algorithms, which we call Lipschitz Bayesian optimization (LBO). This approach does not increase the asymptotic runtime and in some cases drastically improves the performance (while in the worst case the performance is similar). Indeed, in a particular setting, we prove that using the Lipschitz information yields the same or a better bound on the regret compared to using Bayesian optimization on its own. Moreover, we propose a simple heuristics to estimate the Lipschitz constant, and prove that a growing estimate of the Lipschitz constant is in some sense "harmless". Our experiments on 15 datasets with 4 acquisition functions show that in the worst case LBO performs similar to the underlying BO method while in some cases it performs substantially better. Thompson sampling in particular typically saw drastic improvements (as the Lipschitz information corrected for it's well-known "over-exploration" phenomenon) and its LBO variant often outperformed other acquisition functions.

acquisition function, artificial intelligence, optimization problem, (17 more...)

arXiv.org Machine Learning

1810.04336

Genre: Research Report > New Finding (0.68)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)

Add feedback

Does Your Model Know the Digit 6 Is Not a Cat? A Less Biased Evaluation of "Outlier" Detectors

Shafaei, Alireza, Schmidt, Mark, Little, James J.

arXiv.org Machine LearningSep-12-2018

In the real world, a learning system could receive an input that looks nothing like anything it has seen during training, and this can lead to unpredictable behaviour. We thus need to know whether any given input belongs to the population distribution of the training data to prevent unpredictable behaviour in deployed systems. A recent surge of interest on this problem has led to the development of sophisticated techniques in the deep learning literature. However, due to the absence of a standardized problem formulation or an exhaustive evaluation, it is not evident if we can rely on these methods in practice. What makes this problem different from a typical supervised learning setting is that we cannot model the diversity of out-of-distribution samples in practice. The distribution of outliers used in training may not be the same as the distribution of outliers encountered in the application. Therefore, classical approaches that learn inliers vs. outliers with only two datasets can yield optimistic results. We introduce OD-test, a three-dataset evaluation scheme as a practical and more reliable strategy to assess progress on this problem. The OD-test benchmark provides a straightforward means of comparison for methods that address the out-of-distribution sample detection problem. We present an exhaustive evaluation of a broad set of methods from related areas on image classification tasks. Furthermore, we show that for realistic applications of high-dimensional images, the existing methods have low accuracy. Our analysis reveals areas of strength and weakness of each method.

dataset, deep learning, neural network, (19 more...)

arXiv.org Machine Learning

1809.04729

Genre: Research Report (0.82)

Industry: Education (0.48)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.93)

Add feedback

Linear Convergence of Gradient and Proximal-Gradient Methods Under the Polyak-\L{}ojasiewicz Condition

Karimi, Hamed, Nutini, Julie, Schmidt, Mark

arXiv.org Machine LearningJun-11-2018

In 1963, Polyak proposed a simple condition that is sufficient to show a global linear convergence rate for gradient descent. This condition is a special case of the \L{}ojasiewicz inequality proposed in the same year, and it does not require strong convexity (or even convexity). In this work, we show that this much-older Polyak-\L{}ojasiewicz (PL) inequality is actually weaker than the main conditions that have been explored to show linear convergence rates without strong convexity over the last 25 years. We also use the PL inequality to give new analyses of randomized and greedy coordinate descent methods, sign-based gradient descent methods, and stochastic gradient methods in the classic setting (with decreasing or constant step-sizes) as well as the variance-reduced setting. We further propose a generalization that applies to proximal-gradient methods for non-smooth optimization, leading to simple proofs of linear convergence of these methods. Along the way, we give simple convergence results for a wide variety of problems in machine learning: least squares, logistic regression, boosting, resilient backpropagation, L1-regularization, support vector machines, stochastic dual coordinate ascent, and stochastic variance-reduced gradient methods.

artificial intelligence, inequality, machine learning, (15 more...)

arXiv.org Machine Learning

1608.04636

Country: North America > Canada (0.28)

Genre: Research Report > New Finding (0.48)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.90)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Support Vector Machines (0.55)

Add feedback

New Insights into Bootstrapping for Bandits

Vaswani, Sharan, Kveton, Branislav, Wen, Zheng, Rao, Anup, Schmidt, Mark, Abbasi-Yadkori, Yasin

arXiv.org Machine LearningMay-24-2018

We investigate the use of bootstrapping in the bandit setting. We first show that the commonly used non-parametric bootstrapping (NPB) procedure can be provably inefficient and establish a near-linear lower bound on the regret incurred by it under the bandit model with Bernoulli rewards. We show that NPB with an appropriate amount of forced exploration can result in sub-linear albeit sub-optimal regret. As an alternative to NPB, we propose a weighted bootstrapping (WB) procedure. For Bernoulli rewards, WB with multiplicative exponential weights is mathematically equivalent to Thompson sampling (TS) and results in near-optimal regret bounds. Similarly, in the bandit setting with Gaussian rewards, we show that WB with additive Gaussian weights achieves near-optimal regret. Beyond these special cases, we show that WB leads to better empirical performance than TS for several reward distributions bounded on $[0,1]$. For the contextual bandit setting, we give practical guidelines that make bootstrapping simple and efficient to implement and result in good empirical performance on real-world datasets.

artificial intelligence, machine learning, procedure, (16 more...)

arXiv.org Machine Learning

1805.09793

Country: North America > United States > New York > New York County > New York City (0.14)

Genre: Research Report > New Finding (0.47)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.46)

Add feedback

Online Learning Rate Adaptation with Hypergradient Descent

Baydin, Atilim Gunes, Cornish, Robert, Rubio, David Martinez, Schmidt, Mark, Wood, Frank

arXiv.org Machine LearningFeb-25-2018

We introduce a general method for improving the convergence rate of gradient-based optimizers that is easy to implement and works well in practice. We demonstrate the effectiveness of the method in a range of optimization problems by applying it to stochastic gradient descent, stochastic gradient descent with Nesterov momentum, and Adam, showing that it significantly reduces the need for the manual tuning of the initial learning rate for these commonly used algorithms. Our method works by dynamically updating the learning rate during optimization using the gradient with respect to the learning rate of the update rule itself. Computing this "hypergradient" needs little additional computation, requires only one extra copy of the original gradient to be stored in memory, and relies upon nothing more than what is provided by reverse-mode automatic differentiation.

computer based training, educational technology, learning rate, (22 more...)

arXiv.org Machine Learning

1703.04782

Country:

Europe > United Kingdom > England (0.14)
North America > United States > California (0.14)
North America > Canada > Ontario > Toronto (0.14)

Genre:

Research Report (0.85)
Instructional Material (0.68)

Industry: Education > Educational Setting > Online (0.64)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)

Add feedback

Faster Stochastic Variational Inference using Proximal-Gradient Methods with General Divergence Functions

Khan, Mohammad Emtiyaz, Babanezhad, Reza, Lin, Wu, Schmidt, Mark, Sugiyama, Masashi

arXiv.org Machine LearningAug-11-2016

Several recent works have explored stochastic gradient methods for variational inference that exploit the geometry of the variational-parameter space. However, the theoretical properties of these methods are not well-understood and these methods typically only apply to conditionally-conjugate models. We present a new stochastic method for variational inference which exploits the geometry of the variational-parameter space and also yields simple closed-form updates even for non-conjugate models. We also give a convergence-rate analysis of our method and many other previous methods which exploit the geometry of the space. Our analysis generalizes existing convergence results for stochastic mirror-descent on non-convex objectives by using a more general class of divergence functions. Beyond giving a theoretical justification for a variety of recent methods, our experiments show that new algorithms derived in this framework lead to state of the art results on a variety of problems. Further, due to its generality, we expect that our theoretical analysis could also apply to other applications.

artificial intelligence, gradient, machine learning, (18 more...)

arXiv.org Machine Learning

1511.00146

Country:

Europe (0.46)
North America > Canada (0.28)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.49)

Add feedback