AITopics

2309.07364

Country: Europe > Switzerland > Zürich > Zürich (0.14)

Genre: Research Report > New Finding (0.86)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Inductive Learning (1.00)

arXiv.org Artificial IntelligenceJul-10-2023

Promises and Pitfalls of the Linearized Laplace in Bayesian Optimization

Kristiadi, Agustinus, Immer, Alexander, Eschenhagen, Runa, Fortuin, Vincent

The linearized-Laplace approximation (LLA) has been shown to be effective and efficient in constructing Bayesian neural networks. It is theoretically compelling since it can be seen as a Gaussian process posterior with the mean function given by the neural network's maximum-a-posteriori predictive function and the covariance function induced by the empirical neural tangent kernel. However, while its efficacy has been studied in large-scale tasks like image classification, it has not been studied in sequential decision-making problems like Bayesian optimization where Gaussian processes -- with simple mean functions and kernels such as the radial basis function -- are the de-facto surrogate models. In this work, we study the usefulness of the LLA in Bayesian optimization and highlight its strong performance and flexibility. However, we also present some pitfalls that might arise and a potential problem with the LLA when the search space is unbounded.

artificial intelligence, bayesian inference, machine learning, (13 more...)

2304.08309

Country:

Europe (0.68)
North America > Canada (0.28)

Genre: Research Report (1.00)

arXiv.org Artificial IntelligenceJun-6-2023

Stochastic Marginal Likelihood Gradients using Neural Tangent Kernels

Immer, Alexander, van der Ouderaa, Tycho F. A., van der Wilk, Mark, Rätsch, Gunnar, Schölkopf, Bernhard

Selecting hyperparameters in deep learning greatly impacts its effectiveness but requires manual effort and expertise. Recent works show that Bayesian model selection with Laplace approximations can allow to optimize such hyperparameters just like standard neural network parameters using gradients and on the training data. However, estimating a single hyperparameter gradient requires a pass through the entire dataset, limiting the scalability of such algorithms. In this work, we overcome this issue by introducing lower bounds to the linearized Laplace approximation of the marginal likelihood. In contrast to previous estimators, these bounds are amenable to stochastic-gradient-based optimization and allow to trade off estimation accuracy against computational complexity. We derive them using the function-space form of the linearized Laplace, which can be estimated using the neural tangent kernel. Experimentally, we show that the estimators can significantly accelerate gradient-based hyperparameter optimization.

approximation, artificial intelligence, machine learning, (16 more...)

2306.03968

Country:

North America > United States > Hawaii (0.14)
Europe > United Kingdom > England (0.14)
Europe > Switzerland > Zürich > Zürich (0.14)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.88)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.88)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.66)

arXiv.org Artificial IntelligenceJun-1-2023

On the Identifiability and Estimation of Causal Location-Scale Noise Models

Immer, Alexander, Schultheiss, Christoph, Vogt, Julia E., Schölkopf, Bernhard, Bühlmann, Peter, Marx, Alexander

We study the class of location-scale or heteroscedastic noise models (LSNMs), in which the effect $Y$ can be written as a function of the cause $X$ and a noise source $N$ independent of $X$, which may be scaled by a positive function $g$ over the cause, i.e., $Y = f(X) + g(X)N$. Despite the generality of the model class, we show the causal direction is identifiable up to some pathological cases. To empirically validate these theoretical findings, we propose two estimators for LSNMs: an estimator based on (non-linear) feature maps, and one based on neural networks. Both model the conditional distribution of $Y$ given $X$ as a Gaussian parameterized by its natural parameters. When the feature maps are correctly specified, we prove that our estimator is jointly concave, and a consistent estimator for the cause-effect identification task. Although the the neural network does not inherit those guarantees, it can fit functions of arbitrary complexity, and reaches state-of-the-art performance across benchmarks.

artificial intelligence, estimator, machine learning, (15 more...)

2210.09054

Country:

Europe > Switzerland (0.28)
North America > United States (0.28)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.87)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models (0.69)

arXiv.org Machine LearningOct-15-2021

Pathologies in priors and inference for Bayesian transformers

Cinquin, Tristan, Immer, Alexander, Horn, Max, Fortuin, Vincent

In recent years, the transformer has established itself as a workhorse in many applications ranging from natural language processing to reinforcement learning. Similarly, Bayesian deep learning has become the gold-standard for uncertainty estimation in safety-critical applications, where robustness and calibration are crucial. Surprisingly, no successful attempts to improve transformer models in terms of predictive uncertainty using Bayesian inference exist. In this work, we study this curiously underpopulated area of Bayesian transformers. We find that weight-space inference in transformers does not work well, regardless of the approximate posterior. We also find that the prior is at least partially at fault, but that it is very hard to find well-specified weight priors for these models. We hypothesize that these problems stem from the complexity of obtaining a meaningful mapping from weight-space to function-space distributions in the transformer. Therefore, moving closer to function-space, we propose a novel method based on the implicit reparameterization of the Dirichlet distribution to apply variational inference directly to the attention weights. We find that this proposed method performs competitively with our baselines.

diagnostic medicine, machine learning, natural language, (19 more...)

2110.0402

Country: Europe > Switzerland > Zürich > Zürich (0.15)

Genre: Research Report (0.84)

Industry: Health & Medicine > Diagnostic Medicine (0.42)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)
(2 more...)

arXiv.org Machine LearningJun-28-2021

Laplace Redux -- Effortless Bayesian Deep Learning

Daxberger, Erik, Kristiadi, Agustinus, Immer, Alexander, Eschenhagen, Runa, Bauer, Matthias, Hennig, Philipp

Bayesian formulations of deep learning have been shown to have compelling theoretical properties and offer practical functional benefits, such as improved predictive uncertainty quantification and model selection. The Laplace approximation (LA) is a classic, and arguably the simplest family of approximations for the intractable posteriors of deep neural networks. Yet, despite its simplicity, the LA is not as popular as alternatives like variational Bayes or deep ensembles. This may be due to assumptions that the LA is expensive due to the involved Hessian computation, that it is difficult to implement, or that it yields inferior results. In this work we show that these are misconceptions: we (i) review the range of variants of the LA including versions with minimal cost overhead; (ii) introduce "laplace", an easy-to-use software library for PyTorch offering user-friendly access to all major flavors of the LA; and (iii) demonstrate through extensive experiments that the LA is competitive with more popular alternatives in terms of performance, while excelling in terms of computational cost. We hope that this work will serve as a catalyst to a wider adoption of the LA in practical deep learning, including in domains where Bayesian approaches are not typically considered at the moment.

artificial intelligence, machine learning, neural network, (15 more...)

2106.14806

Genre: Research Report (0.40)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

arXiv.org Machine LearningMay-11-2021

Scalable Marginal Likelihood Estimation for Model Selection in Deep Learning

Immer, Alexander, Bauer, Matthias, Fortuin, Vincent, Rätsch, Gunnar, Khan, Mohammad Emtiyaz

Marginal-likelihood based model-selection, even though promising, is rarely used in deep learning due to estimation difficulties. Instead, most approaches rely on validation data, which may not be readily available. In this work, we present a scalable marginal-likelihood estimation method to select both the hyperparameters and network architecture based on the training data alone. Some hyperparameters can be estimated online during training, simplifying the procedure. Our marginal-likelihood estimate is based on Laplace's method and Gauss-Newton approximations to the Hessian, and it outperforms cross-validation and manual-tuning on standard regression and image classification datasets, especially in terms of calibration and out-of-distribution detection. Our work shows that marginal likelihoods can improve generalization and be useful when validation data is unavailable (e.g., in nonstationary settings).

approximation, deep learning, neural network, (18 more...)

2104.04975

Country:

North America > United States > Arizona (0.14)
North America > Canada > Ontario > Toronto (0.14)
Europe > Switzerland > Zürich > Zürich (0.14)

Genre: Research Report (0.50)

arXiv.org Machine LearningOct-22-2020

Improving predictions of Bayesian neural networks via local linearization

Immer, Alexander, Korzepa, Maciej, Bauer, Matthias

The generalized Gauss-Newton (GGN) approximation is often used to make practical Bayesian deep learning approaches scalable by replacing a second order derivative with a product of first order derivatives. In this paper we argue that the GGN approximation should be understood as a local linearization of the underlying Bayesian neural network (BNN), which turns the BNN into a generalized linear model (GLM). Because we use this linearized model for posterior inference, we should also predict using this modified model instead of the original one. We refer to this modified predictive as "GLM predictive" and show that it effectively resolves common underfitting problems of the Laplace approximation. It extends previous results in this vein to general likelihoods and has an equivalent Gaussian process formulation, which enables alternative inference schemes for BNNs in function space. We demonstrate the effectiveness of our approach on several standard classification datasets and on out-of-distribution detection.

approximation, deep learning, neural network, (20 more...)

2008.084

Country:

North America > United States (0.28)
Europe > United Kingdom > England (0.14)
Europe > Switzerland > Zürich > Zürich (0.14)

Genre: Research Report (0.63)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

arXiv.org Machine LearningJul-21-2020

Disentangling the Gauss-Newton Method and Approximate Inference for Neural Networks

Immer, Alexander

In this thesis, we disentangle the generalized Gauss-Newton and approximate inference for Bayesian deep learning. The generalized Gauss-Newton method is an optimization method that is used in several popular Bayesian deep learning algorithms. Algorithms that combine the Gauss-Newton method with the Laplace and Gaussian variational approximation have recently led to state-of-the-art results in Bayesian deep learning. While the Laplace and Gaussian variational approximation have been studied extensively, their interplay with the Gauss-Newton method remains unclear. Recent criticism of priors and posterior approximations in Bayesian deep learning further urges the need for a deeper understanding of practical algorithms. The individual analysis of the Gauss-Newton method and Laplace and Gaussian variational approximations for neural networks provides both theoretical insight and new practical algorithms. We find that the Gauss-Newton method simplifies the underlying probabilistic model significantly. In particular, the combination of the Gauss-Newton method with approximate inference can be cast as inference in a linear or Gaussian process model. The Laplace and Gaussian variational approximation can subsequently provide a posterior approximation to these simplified models. This new disentangled understanding of recent Bayesian deep learning algorithms also leads to new methods: first, the connection to Gaussian processes enables new function-space inference algorithms. Second, we present a marginal likelihood approximation of the underlying probabilistic model to tune neural network hyperparameters. Finally, the identified underlying models lead to different methods to compute predictive distributions. In fact, we find that these prediction methods for Bayesian neural networks often work better than the default choice and solve a common issue with the Laplace approximation.

approximation, deep learning, neural network, (19 more...)

2007.11994

Country: North America > Canada (0.28)

Genre: Research Report > Experimental Study (0.45)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (1.00)

arXiv.org Machine LearningJun-26-2020

Continual Deep Learning by Functional Regularisation of Memorable Past

Pan, Pingbo, Swaroop, Siddharth, Immer, Alexander, Eschenhagen, Runa, Turner, Richard E., Khan, Mohammad Emtiyaz

Continually learning new skills is important for intelligent systems, yet standard deep learning methods suffer from catastrophic forgetting of the past. Recent works address this with weight regularisation. Functional regularisation, although computationally expensive, is expected to perform better, but rarely does so in practice. In this paper, we fix this issue by using a new functional-regularisation approach that utilises a few memorable past examples crucial to avoid forgetting. By using a Gaussian Process formulation of deep networks, our approach enables training in weight-space while identifying both the memorable past and a functional prior. Our method achieves state-of-the-art performance on standard benchmarks and opens a new direction for life-long learning where regularisation and memory-based methods are naturally combined.

approximation, deep learning, neural network, (17 more...)

2004.1407

Country:

Europe (0.92)
North America > United States (0.46)

Genre: Research Report (1.00)

Industry: Education > Educational Setting > Continuing Education (0.34)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)