fitc
Understanding Probabilistic Sparse Gaussian Process Approximations
Matthias Bauer, Mark van der Wilk, Carl Edward Rasmussen
Good sparse approximations are essential for practical inference in Gaussian Processes as the computational cost of exact methods is prohibitive for large datasets. The Fully Independent Training Conditional (FITC) and the Variational Free Energy (VFE) approximations are two recent popular methods. Despite superficial similarities, these approximations have surprisingly different theoretical properties and behave differently in practice. We thoroughly investigate the two methods for regression both analytically and through illustrative examples, and draw conclusions to guide practical application.
Review for NeurIPS paper: Kernel Methods Through the Roof: Handling Billions of Points Efficiently
Weaknesses: In my opinion, comparing Nystrom for kernel ridge regression to variational GPs is apples to oranges in a lot of ways that are frankly unfair to variational GPs. In my view, a much more appropriate comparison would be a KeOps based implementation of SGPR or FITC with fixed inducing points. Variational GPs introduce a very large number of parameters in the form of the variational distribution and inducing point locations that require optimization and significantly increase the total amount of time spent in optimization. Methods that train GPs through the marginal likelihood with fixed inducing locations (e.g., as in Nystrom) may have as few as 3 parameters to fit. By contrast, SVGP learns (1) a variational distribution q(u) including a variational covariance matrix, and (2) the inducing point locations.
Understanding Probabilistic Sparse Gaussian Process Approximations Matthias Bauer โ โ Department of Engineering, University of Cambridge, Cambridge, UK
Good sparse approximations are essential for practical inference in Gaussian Processes as the computational cost of exact methods is prohibitive for large datasets. The Fully Independent Training Conditional (FITC) and the Variational Free Energy (VFE) approximations are two recent popular methods. Despite superficial similarities, these approximations have surprisingly different theoretical properties and behave differently in practice. We thoroughly investigate the two methods for regression both analytically and through illustrative examples, and draw conclusions to guide practical application.
Learning Gaussian Processes by Minimizing PAC-Bayesian Generalization Bounds
Reeb, David, Doerr, Andreas, Gerwinn, Sebastian, Rakitsch, Barbara
Gaussian Processes (GPs) are a generic modelling tool for supervised learning. While they have been successfully applied on large datasets, their use in safety-critical applications is hindered by the lack of good performance guarantees. To this end, we propose a method to learn GPs and their sparse approximations by directly optimizing a PAC-Bayesian bound on their generalization performance, instead of maximizing the marginal likelihood. Besides its theoretical appeal, we find in our evaluation that our learning method is robust and yields significantly better generalization guarantees than other common GP approaches on several regression benchmark datasets.
Understanding Probabilistic Sparse Gaussian Process Approximations
Bauer, Matthias, van der Wilk, Mark, Rasmussen, Carl Edward
Good sparse approximations are essential for practical inference in Gaussian Processes as the computational cost of exact methods is prohibitive for large datasets. The Fully Independent Training Conditional (FITC) and the Variational Free Energy (VFE) approximations are two recent popular methods. Despite superficial similarities, these approximations have surprisingly different theoretical properties and behave differently in practice. We thoroughly investigate the two methods for regression both analytically and through illustrative examples, and draw conclusions to guide practical application.
Understanding Probabilistic Sparse Gaussian Process Approximations
Bauer, Matthias, Wilk, Mark van der, Rasmussen, Carl Edward
Good sparse approximations are essential for practical inference in Gaussian Processes as the computational cost of exact methods is prohibitive for large datasets. The Fully Independent Training Conditional (FITC) and the Variational Free Energy (VFE) approximations are two recent popular methods. Despite superficial similarities, these approximations have surprisingly different theoretical properties and behave differently in practice. We thoroughly investigate the two methods for regression both analytically and through illustrative examples, and draw conclusions to guide practical application.
EigenGP: Gaussian Process Models with Adaptive Eigenfunctions
Peng, Hao (Purdue University) | Qi, Yuan (Purdue University)
Gaussian processes (GPs) provide a nonparametric representation of functions. However, classical GP inference suffers from high computational cost for big data. In this paper, we propose a new Bayesian approach, EigenGP, that learns both basis dictionary elements โ eigenfunctions of a GP prior โ and prior precisions in a sparse finite model. It is well known that, among all orthogonal basis functions, eigenfunctions can provide the most compact representation. Unlike other sparse Bayesian finite models where the basis function has a fixed form, our eigenfunctions live in a reproducing kernel Hilbert space as a finite linear combination of kernel functions. We learn the dictionary elements โ eigenfunctions โ and the prior precisions over these elements as well as all the other hyperparameters from data by maximizing the model marginal likelihood. We explore computational linear algebra to simplify the gradient computation significantly. Our experimental results demonstrate improved predictive performance of EigenGP over alternative sparse GP methods as well as relevance vector machines.
Kernel Interpolation for Scalable Structured Gaussian Processes (KISS-GP)
Wilson, Andrew Gordon, Nickisch, Hannes
We introduce a new structured kernel interpolation (SKI) framework, which generalises and unifies inducing point methods for scalable Gaussian processes (GPs). SKI methods produce kernel approximations for fast computations through kernel interpolation. The SKI framework clarifies how the quality of an inducing point approach depends on the number of inducing (aka interpolation) points, interpolation strategy, and GP covariance kernel. SKI also provides a mechanism to create new scalable kernel methods, through choosing different kernel interpolation strategies. Using SKI, with local cubic kernel interpolation, we introduce KISS-GP, which is 1) more scalable than inducing point alternatives, 2) naturally enables Kronecker and Toeplitz algebra for substantial additional gains in scalability, without requiring any grid data, and 3) can be used for fast and expressive kernel learning. KISS-GP costs O(n) time and storage for GP inference. We evaluate KISS-GP for kernel matrix approximation, kernel learning, and natural sound modelling.
Integrated Pre-Processing for Bayesian Nonlinear System Identification with Gaussian Processes
Frigola, Roger, Rasmussen, Carl Edward
We introduce GP-FNARX: a new model for nonlinear system identification based on a nonlinear autoregressive exogenous model (NARX) with filtered regressors (F) where the nonlinear regression problem is tackled using sparse Gaussian processes (GP). We integrate data pre-processing with system identification into a fully automated procedure that goes from raw data to an identified model. Both pre-processing parameters and GP hyper-parameters are tuned by maximizing the marginal likelihood of the probabilistic model. We obtain a Bayesian model of the system's dynamics which is able to report its uncertainty in regions where the data is scarce. The automated approach, the modeling of uncertainty and its relatively low computational cost make of GP-FNARX a good candidate for applications in robotics and adaptive control.
A Framework for Evaluating Approximation Methods for Gaussian Process Regression
Chalupka, Krzysztof, Williams, Christopher K. I., Murray, Iain
Gaussian process (GP) predictors are an important component of many Bayesian approaches to machine learning. However, even a straightforward implementation of Gaussian process regression (GPR) requires O(n^2) space and O(n^3) time for a dataset of n examples. Several approximation methods have been proposed, but there is a lack of understanding of the relative merits of the different approximations, and in what situations they are most useful. We recommend assessing the quality of the predictions obtained as a function of the compute time taken, and comparing to standard baselines (e.g., Subset of Data and FITC). We empirically investigate four different approximation algorithms on four different prediction problems, and make our code available to encourage future comparisons.