Goto

Collaborating Authors

 van der Wilk, Mark


Tighter Bounds on the Log Marginal Likelihood of Gaussian Process Regression Using Conjugate Gradients

arXiv.org Machine Learning

We propose a lower bound on the log marginal likelihood of Gaussian process regression models that can be computed without matrix factorisation of the full kernel matrix. We show that approximate maximum likelihood learning of model parameters by maximising our lower bound retains many of the sparse variational approach benefits while reducing the bias introduced into parameter learning. The basis of our bound is a more careful analysis of the log-determinant term appearing in the log marginal likelihood, as well as using the method of conjugate gradients to derive tight lower bounds on the term involving a quadratic form. Our approach is a step forward in unifying methods relying on lower bound maximisation (e.g. variational methods) and iterative approaches based on conjugate gradients for training Gaussian processes. In experiments, we show improved predictive performance with our model for a comparable amount of training time compared to other conjugate gradient based approaches.


Bayesian Neural Network Priors Revisited

arXiv.org Machine Learning

In a Bayesian neural network (BNN), we specify a prior p(w) over the neural network parameters, and compute the posterior distribution over parameters conditioned on training data, p(w x, y) p(y w, x)p(w)/p(y x). This procedure should give considerable advantages for reasoning about predictive uncertainty, which is especially relevant in the small-data setting. Crucially, to perform Bayesian inference, we need to choose a prior that accurately reflects our beliefs about the parameters before seeing any data (Bayes, 1763; Gelman et al., 2013). However, the most common choice of the prior for BNN weights is the simplest one: the isotropic Gaussian. Isotropic Gaussians are used across almost all fields of Bayesian deep learning, ranging from variational inference (Blundell et al., 2015; Dusenberry et al., 2020), to sampling-based inference (Zhang et al., 2019), and even to infinite networks (Lee et al., 2017; Garriga-Alonso et al., 2019). This is troubling, since isotropic Gaussian priors are almost certainly not the best choice. Indeed, despite the progress on more accurate and efficient inference procedures, in most settings, the posterior predictive of BNNs using a Gaussian prior still leads to worse predictive performance than a baseline obtained by training the network with standard stochastic gradient descent (SGD) (e.g., Zhang et al., 2019; Heek & Kalchbrenner, 2019; Wenzel et al., 2020a). However, it has been shown that the performance of BNNs can be improved by artificially reducing posterior uncertainty using "cold posteriors" (Wenzel et al., 2020a).


Correlated Weights in Infinite Limits of Deep Convolutional Neural Networks

arXiv.org Machine Learning

Infinite width limits of deep neural networks often have tractable forms. They have been used to analyse the behaviour of finite networks, as well as being useful methods in their own right. When investigating infinitely wide CNNs it was observed that the correlations arising from spatial weight sharing disappear in the infinite limit. This is undesirable, as spatial correlation is the main motivation behind CNNs. We show that the loss of this property is not a consequence of the infinite limit, but rather of choosing an independent weight prior. Correlating the weights maintains the correlations in the activations. Varying the amount of correlation interpolates between independent-weight limits and mean-pooling. Empirical evaluation of the infinitely wide network shows that optimal performance is achieved between the extremes, indicating that correlations can be useful.


Design of Experiments for Verifying Biomolecular Networks

arXiv.org Machine Learning

There is a growing trend in molecular and synthetic biology of using mechanistic (non machine learning) models to design biomolecular networks. Once designed, these networks need to be validated by experimental results to ensure the theoretical network correctly models the true system. However, these experiments can be expensive and time consuming. We propose a design of experiments approach for validating these networks efficiently. Gaussian processes are used to construct a probabilistic model of the discrepancy between experimental results and the designed response, then a Bayesian optimization strategy used to select the next sample points. We compare different design criteria and develop a stopping criterion based on a metric that quantifies this discrepancy over the whole surface, and its uncertainty. We test our strategy on simulated data from computer models of biochemical processes.


Understanding Variational Inference in Function-Space

arXiv.org Machine Learning

Recent work has attempted to directly approximate the `function-space' or predictive posterior distribution of Bayesian models, without approximating the posterior distribution over the parameters. This is appealing in e.g. Bayesian neural networks, where we only need the former, and the latter is hard to represent. In this work, we highlight some advantages and limitations of employing the Kullback-Leibler divergence in this setting. For example, we show that minimizing the KL divergence between a wide class of parametric distributions and the posterior induced by a (non-degenerate) Gaussian process prior leads to an ill-defined objective function. Then, we propose (featurized) Bayesian linear regression as a benchmark for `function-space' inference methods that directly measures approximation quality. We apply this methodology to assess aspects of the objective function and inference scheme considered in Sun, Zhang, Shi, and Grosse (2018), emphasizing the quality of approximation to Bayesian inference as opposed to predictive performance.


Convergence of Sparse Variational Inference in Gaussian Processes Regression

arXiv.org Machine Learning

Gaussian processes are distributions over functions that are versatile and mathematically convenient priors in Bayesian modelling. However, their use is often impeded for data with large numbers of observations, $N$, due to the cubic (in $N$) cost of matrix operations used in exact inference. Many solutions have been proposed that rely on $M \ll N$ inducing variables to form an approximation at a cost of $\mathcal{O}(NM^2)$. While the computational cost appears linear in $N$, the true complexity depends on how $M$ must scale with $N$ to ensure a certain quality of the approximation. In this work, we investigate upper and lower bounds on how $M$ needs to grow with $N$ to ensure high quality approximations. We show that we can make the KL-divergence between the approximate model and the exact posterior arbitrarily small for a Gaussian-noise regression model with $M\ll N$. Specifically, for the popular squared exponential kernel and $D$-dimensional Gaussian distributed covariates, $M=\mathcal{O}((\log N)^D)$ suffice and a method with an overall computational cost of $\mathcal{O}(N(\log N)^{2D}(\log\log N)^2)$ can be used to perform inference.


Variational Orthogonal Features

arXiv.org Machine Learning

Sparse stochastic variational inference allows Gaussian process models to be applied to large datasets. The per iteration computational cost of inference with this method is $\mathcal{O}(\tilde{N}M^2+M^3),$ where $\tilde{N}$ is the number of points in a minibatch and $M$ is the number of `inducing features', which determine the expressiveness of the variational family. Several recent works have shown that for certain priors, features can be defined that remove the $\mathcal{O}(M^3)$ cost of computing a minibatch estimate of an evidence lower bound (ELBO). This represents a significant computational savings when $M\gg \tilde{N}$. We present a construction of features for any stationary prior kernel that allow for computation of an unbiased estimator to the ELBO using $T$ Monte Carlo samples in $\mathcal{O}(\tilde{N}T+M^2T)$ and in $\mathcal{O}(\tilde{N}T+MT)$ with an additional approximation. We analyze the impact of this additional approximation on inference quality.


Revisiting the Train Loss: an Efficient Performance Estimator for Neural Architecture Search

arXiv.org Machine Learning

Reliable yet efficient evaluation of generalisation performance of a proposed architecture is crucial to the success of neural architecture search (NAS). Traditional approaches face a variety of limitations: training each architecture to completion is prohibitively expensive, early stopping estimates may correlate poorly with fully trained performance, and model-based estimators require large training sets. Instead, motivated by recent results linking training speed and generalisation with stochastic gradient descent, we propose to estimate the final test performance based on the sum of training losses. Our estimator is inspired by the marginal likelihood, which is used for Bayesian model selection. Our model-free estimator is simple, efficient, and cheap to implement, and does not require hyperparameter-tuning or surrogate training before deployment. We demonstrate empirically that our estimator consistently outperforms other baselines and can achieve a rank correlation of 0.95 with final test accuracy on the NAS-Bench201 dataset within 50 epochs.


Scalable Bayesian dynamic covariance modeling with variational Wishart and inverse Wishart processes

arXiv.org Machine Learning

We implement gradient-based variational inference routines for Wishart and inverse Wishart processes, which we apply as Bayesian models for the dynamic, heteroskedastic covariance matrix of a multivariate time series. The Wishart and inverse Wishart processes are constructed from i.i.d. Gaussian processes, for which we apply existing black-box variational inference algorithms for approximate Gaussian process inference. These methods scale well with the length of the time series, however, they fail in the case of the Wishart process, an issue we resolve with a simple modification into an additive white noise parameterization of the model. This modification is also key to implementing a factored variant of the construction, allowing inference to additionally scale to high-dimensional covariance matrices. As with existing MCMC-based inference routines for the Wishart and inverse Wishart processes, we show that these variational alternatives significantly outperform multivariate GARCH baselines when forecasting the covariances of returns on financial instruments.


Overcoming Mean-Field Approximations in Recurrent Gaussian Process Models

arXiv.org Machine Learning

We identify a new variational inference scheme for dynamical systems whose transition function is modelled by a Gaussian process. Inference in this setting has either employed computationally intensive MCMC methods, or relied on factorisations of the variational posterior. As we demonstrate in our experiments, the factorisation between latent system states and transition function can lead to a miscalibrated posterior and to learning unnecessarily large noise terms. We eliminate this factorisation by explicitly modelling the dependence between state trajectories and the Gaussian process posterior. Samples of the latent states can then be tractably generated by conditioning on this representation. The method we obtain (VCDT: variationally coupled dynamics and trajectories) gives better predictive performance and more calibrated estimates of the transition function, yet maintains the same time and space complexities as mean-field methods. Code is available at: github.com/ialong/GPt.