Williams, Christopher
Score-Optimal Diffusion Schedules
Williams, Christopher, Campbell, Andrew, Doucet, Arnaud, Syed, Saifuddin
DDMs generate a path of probability distributions interpolating between a reference Gaussian distribution and a data distribution by incrementally injecting noise into the data. To numerically simulate the sampling process, a discretisation schedule from the reference back towards clean data must be chosen. An appropriate discretisation schedule is crucial to obtain high quality samples. However, beyond hand crafted heuristics, a general method for choosing this schedule remains elusive. This paper presents a novel algorithm for adaptively selecting an optimal discretisation schedule with respect to a cost that we derive. Our cost measures the work done by the simulation procedure to transport samples from one point in the diffusion path to the next. Our method does not require hyperparameter tuning and adapts to the dynamics and geometry of the diffusion path. Our algorithm only involves the evaluation of the estimated Stein score, making it scalable to existing pre-trained models at inference time and online during training. We find that our learned schedule recovers performant schedules previously only discovered through manual search and obtains competitive FID scores on image datasets.
Approximations to the Fisher Information Metric of Deep Generative Models for Out-Of-Distribution Detection
Dauncey, Sam, Holmes, Chris, Williams, Christopher, Falck, Fabian
Likelihood-based deep generative models such as score-based diffusion models and variational autoencoders are state-of-the-art machine learning models approximating high-dimensional distributions of data such as images, text, or audio. One of many downstream tasks they can be naturally applied to is out-of-distribution (OOD) detection. However, seminal work by Nalisnick et al. which we reproduce showed that deep generative models consistently infer higher log-likelihoods for OOD data than data they were trained on, marking an open problem. In this work, we analyse using the gradient of a data point with respect to the parameters of the deep generative model for OOD detection, based on the simple intuition that OOD data should have larger gradient norms than training data. We formalise measuring the size of the gradient as approximating the Fisher information metric. We show that the Fisher information matrix (FIM) has large absolute diagonal values, motivating the use of chi-square distributed, layer-wise gradient norms as features. We combine these features to make a simple, model-agnostic and hyperparameter-free method for OOD detection which estimates the joint density of the layer-wise gradient norms for a given data point. We find that these layer-wise gradient norms are weakly correlated, rendering their combined usage informative, and prove that the layer-wise gradient norms satisfy the principle of (data representation) invariance. Our empirical results indicate that this method outperforms the Typicality test for most deep generative models and image dataset pairings.
A Unified Framework for U-Net Design and Analysis
Williams, Christopher, Falck, Fabian, Deligiannidis, George, Holmes, Chris, Doucet, Arnaud, Syed, Saifuddin
U-Nets are a go-to, state-of-the-art neural architecture across numerous tasks for continuous signals on a square such as images and Partial Differential Equations (PDE), however their design and architecture is understudied. In this paper, we provide a framework for designing and analysing general U-Net architectures. We present theoretical results which characterise the role of the encoder and decoder in a U-Net, their high-resolution scaling limits and their conjugacy to ResNets via preconditioning. We propose Multi-ResNets, U-Nets with a simplified, wavelet-based encoder without learnable parameters. Further, we show how to design novel U-Net architectures which encode function constraints, natural bases, or the geometry of the data. In diffusion models, our framework enables us to identify that high-frequency information is dominated by noise exponentially faster, and show how U-Nets with average pooling exploit this. In our experiments, we demonstrate how Multi-ResNets achieve competitive and often superior performance compared to classical U-Nets in image segmentation, PDE surrogate modelling, and generative modelling with diffusion models. Our U-Net framework paves the way to study the theoretical properties of U-Nets and design natural, scalable neural architectures for a multitude of problems beyond the square.
A Multi-Resolution Framework for U-Nets with Applications to Hierarchical VAEs
Falck, Fabian, Williams, Christopher, Danks, Dominic, Deligiannidis, George, Yau, Christopher, Holmes, Chris, Doucet, Arnaud, Willetts, Matthew
U-Net architectures are ubiquitous in state-of-the-art deep learning, however their regularisation properties and relationship to wavelets are understudied. In this paper, we formulate a multi-resolution framework which identifies U-Nets as finite-dimensional truncations of models on an infinite-dimensional function space. We provide theoretical results which prove that average pooling corresponds to projection within the space of square-integrable functions and show that U-Nets with average pooling implicitly learn a Haar wavelet basis representation of the data. We then leverage our framework to identify state-of-the-art hierarchical VAEs (HVAEs), which have a U-Net architecture, as a type of two-step forward Euler discretisation of multi-resolution diffusion processes which flow from a point mass, introducing sampling instabilities. We also demonstrate that HVAEs learn a representation of time which allows for improved parameter efficiency through weight-sharing. We use this observation to achieve state-of-the-art HVAE performance with half the number of parameters of existing models, exploiting the properties of our continuous-time formulation.
A Generative Model for Parts-based Object Segmentation
Eslami, S., Williams, Christopher
The Shape Boltzmann Machine (SBM) has recently been introduced as a state-of-the-art model of foreground/background object shape. We extend the SBM to account for the foreground object's parts. Our model, the Multinomial SBM (MSBM), can capture both local and global statistics of part shapes accurately. We combine the MSBM with an appearance model to form a fully generative model of images of objects. Parts-based image segmentations are obtained simply by performing probabilistic inference in the model. We apply the model to two challenging datasets which exhibit significant shape and appearance variability, and find that it obtains results that are comparable to the state-of-the-art.
Multi-task Gaussian Process Learning of Robot Inverse Dynamics
Williams, Christopher, Klanke, Stefan, Vijayakumar, Sethu, Chai, Kian M.
The inverse dynamics problem for a robotic manipulator is to compute the torques needed at the joints to drive it along a given trajectory; it is beneficial to be able to learn this function for adaptive control. A given robot manipulator will often need to be controlled while holding different loads in its end effector, giving rise to a multi-task learning problem. We show how the structure of the inverse dynamics problem gives rise to a multi-task Gaussian process prior over functions, where the inter-task similarity depends on the underlying dynamic parameters. Experiments demonstrate that this multi-task formulation generally improves performance over either learning only on single tasks or pooling the data over all tasks.
Multi-task Gaussian Process Prediction
Bonilla, Edwin V., Chai, Kian M., Williams, Christopher
In this paper we investigate multi-task learning in the context of Gaussian Processes (GP).We propose a model that learns a shared covariance function on input-dependent features and a "free-form" covariance matrix over tasks. This allows forgood flexibility when modelling inter-task dependencies while avoiding the need for large amounts of data for training. We show that under the assumption ofnoise-free observations and a block design, predictions for a given task only depend on its target values and therefore a cancellation of inter-task transfer occurs.We evaluate the benefits of our model on two practical applications: a compiler performance prediction problem and an exam score prediction task. Additionally, we make use of GP approximations and properties of our model in order to provide scalability to large data sets.
Factorial Switching Kalman Filters for Condition Monitoring in Neonatal Intensive Care
Williams, Christopher, Quinn, John, Mcintosh, Neil
The observed physiological dynamics of an infant receiving intensive care are affected by many possible factors, including interventions to the baby, the operation of the monitoring equipment and the state of health. The Factorial Switching Kalman Filter can be used to infer the presence of such factors from a sequence of observations, and to estimate the true values where these observations have been corrupted. We apply this model to clinical time series data and show it to be effective in identifying a number of artifactual and physiological patterns.
Factorial Switching Kalman Filters for Condition Monitoring in Neonatal Intensive Care
Williams, Christopher, Quinn, John, Mcintosh, Neil
The observed physiological dynamics of an infant receiving intensive care are affected by many possible factors, including interventions to the baby, the operation of the monitoring equipment and the state of health. The Factorial Switching Kalman Filter can be used to infer the presence ofsuch factors from a sequence of observations, and to estimate the true values where these observations have been corrupted. We apply this model to clinical time series data and show it to be effective in identifying a number of artifactual and physiological patterns.
Using the Equivalent Kernel to Understand Gaussian Process Regression
Sollich, Peter, Williams, Christopher
The equivalent kernel [1] is a way of understanding how Gaussian process regressionworks for large sample sizes based on a continuum limit. In this paper we show (1) how to approximate the equivalent kernel of the widely-used squared exponential (or Gaussian) kernel and related kernels, and(2) how analysis using the equivalent kernel helps to understand the learning curves for Gaussian processes.