Snoek, Jasper
Bayesian Optimization via Continual Variational Last Layer Training
Brunzema, Paul, Jordahn, Mikkel, Willes, John, Trimpe, Sebastian, Snoek, Jasper, Harrison, James
Gaussian Processes (GPs) are widely seen as the state-of-the-art surrogate models for Bayesian optimization (BO) due to their ability to model uncertainty and their performance on tasks where correlations are easily captured (such as those defined by Euclidean metrics) and their ability to be efficiently updated online. However, the performance of GPs depends on the choice of kernel, and kernel selection for complex correlation structures is often difficult or must be made bespoke. While Bayesian neural networks (BNNs) are a promising direction for higher capacity surrogate models, they have so far seen limited use due to poor performance on some problem types. In this paper, we propose an approach which shows competitive performance on many problem types, including some that BNNs typically struggle with. We build on variational Bayesian last layers (VBLLs), and connect training of these models to exact conditioning in GPs. We exploit this connection to develop an efficient online training algorithm that interleaves conditioning and optimization. Our findings suggest that VBLL networks significantly outperform GPs and other BNN architectures on tasks with complex input correlations, and match the performance of well-tuned GPs on established benchmark tasks.
Variational Bayesian Last Layers
Harrison, James, Willes, John, Snoek, Jasper
We introduce a deterministic variational formulation for training Bayesian last layer neural networks. This yields a sampling-free, single-pass model and loss that effectively improves uncertainty estimation. Our variational Bayesian last layer (VBLL) can be trained and evaluated with only quadratic complexity in last layer width, and is thus (nearly) computationally free to add to standard architectures. We experimentally investigate VBLLs, and show that they improve predictive accuracy, calibration, and out of distribution detection over baselines across both regression and classification. Finally, we investigate combining VBLL layers with variational Bayesian feature learning, yielding a lower variance collapsed variational inference method for Bayesian neural networks. Well-calibrated uncertainty quantification is essential for reliable decision-making with machine learning systems.
Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models
Singh, Avi, Co-Reyes, John D., Agarwal, Rishabh, Anand, Ankesh, Patil, Piyush, Garcia, Xavier, Liu, Peter J., Harrison, James, Lee, Jaehoon, Xu, Kelvin, Parisi, Aaron, Kumar, Abhishek, Alemi, Alex, Rizkowsky, Alex, Nova, Azade, Adlam, Ben, Bohnet, Bernd, Elsayed, Gamaleldin, Sedghi, Hanie, Mordatch, Igor, Simpson, Isabelle, Gur, Izzeddin, Snoek, Jasper, Pennington, Jeffrey, Hron, Jiri, Kenealy, Kathleen, Swersky, Kevin, Mahajan, Kshiteej, Culp, Laura, Xiao, Lechao, Bileschi, Maxwell L., Constant, Noah, Novak, Roman, Liu, Rosanne, Warkentin, Tris, Qian, Yundi, Bansal, Yamini, Dyer, Ethan, Neyshabur, Behnam, Sohl-Dickstein, Jascha, Fiedel, Noah
Fine-tuning language models~(LMs) on human-generated data remains a prevalent practice. However, the performance of such models is often limited by the quantity and diversity of high-quality human data. In this paper, we explore whether we can go beyond human data on tasks where we have access to scalar feedback, for example, on math problems where one can verify correctness. To do so, we investigate a simple self-training method based on expectation-maximization, which we call ReST$^{EM}$, where we (1) generate samples from the model and filter them using binary feedback, (2) fine-tune the model on these samples, and (3) repeat this process a few times. Testing on advanced MATH reasoning and APPS coding benchmarks using PaLM-2 models, we find that ReST$^{EM}$ scales favorably with model size and significantly surpasses fine-tuning only on human data. Overall, our findings suggest self-training with feedback can substantially reduce dependence on human-generated data.
Kernel Regression with Infinite-Width Neural Networks on Millions of Examples
Adlam, Ben, Lee, Jaehoon, Padhy, Shreyas, Nado, Zachary, Snoek, Jasper
Kernel methods are often contrasted with deep learning, but recent advances in machine learning have identified and developed exciting correspondences [Lee et al., 2018, Matthews et al., 2018, Jacot et al., 2018]. While a useful method in its own right, kernel regression has been used to better understand neural networks and deep learning. More specifically, if the parameters of a neural network are treated as random variables whose distribution is set by the initialization, we can view the neural network as a random function. Then as the width of the network becomes large, the distribution of this random function is a Gaussian process with a specific covariance function or kernel. We refer to kernels that arise from this connection with infinite-width neural networks as neural kernels. The specific kernel is determined by the architecture, inference type, and other hyperparameters of the neural network. Moreover, the connection between neural networks and Gaussian processes has generated many high-performance kernels for diverse or nonstandard data modalities, such as images, sequences, and graphs. This performance often comes at a cost, as the kernels require significantly more compute than standard kernels such as RBFs.
A Simple Approach to Improve Single-Model Deep Uncertainty via Distance-Awareness
Liu, Jeremiah Zhe, Padhy, Shreyas, Ren, Jie, Lin, Zi, Wen, Yeming, Jerfel, Ghassen, Nado, Zack, Snoek, Jasper, Tran, Dustin, Lakshminarayanan, Balaji
Accurate uncertainty quantification is a major challenge in deep learning, as neural networks can make overconfident errors and assign high confidence predictions to out-of-distribution (OOD) inputs. The most popular approaches to estimate predictive uncertainty in deep learning are methods that combine predictions from multiple neural networks, such as Bayesian neural networks (BNNs) and deep ensembles. However their practicality in real-time, industrial-scale applications are limited due to the high memory and computational cost. Furthermore, ensembles and BNNs do not necessarily fix all the issues with the underlying member networks. In this work, we study principled approaches to improve the uncertainty property of a single network, based on a single, deterministic representation. By formalizing the uncertainty quantification as a minimax learning problem, we first identify distance awareness, i.e., the model's ability to quantify the distance of a testing example from the training data, as a necessary condition for a DNN to achieve highquality (i.e., minimax optimal) uncertainty estimation. We then propose Spectral-normalized Neural Gaussian Process (SNGP), a simple method that improves the distance-awareness ability of modern DNNs with two simple changes: (1) applying spectral normalization to hidden weights to enforce bi-Lipschitz smoothness in representations and (2) replacing the last output layer with a Gaussian process layer. On a suite of vision and language understanding benchmarks and on modern architectures (Wide-ResNet and BERT), SNGP consistently outperforms other single-model approaches in prediction, calibration and out-of-domain detection. Furthermore, SNGP provides complementary benefits to popular techniques such as deep ensembles and data augmentation, making it a simple and scalable building block for probabilistic deep learning.
Deep Learning for Bayesian Optimization of Scientific Problems with High-Dimensional Structure
Kim, Samuel, Lu, Peter Y., Loh, Charlotte, Smith, Jamie, Snoek, Jasper, Soljaฤiฤ, Marin
Bayesian optimization (BO) is a popular paradigm for global optimization of expensive black-box functions, but there are many domains where the function is not completely a black-box. The data may have some known structure (e.g. symmetries) and/or the data generation process may be a composite process that yields useful intermediate or auxiliary information in addition to the value of the optimization objective. However, surrogate models traditionally employed in BO, such as Gaussian Processes (GPs), scale poorly with dataset size and do not easily accommodate known structure. Instead, we use Bayesian neural networks, a class of scalable and flexible surrogate models with inductive biases, to extend BO to complex, structured problems with high dimensionality. We demonstrate BO on a number of realistic problems in physics and chemistry, including topology optimization of photonic crystal materials using convolutional neural networks, and chemical property optimization of molecules using graph neural networks. On these complex tasks, we show that neural networks often outperform GPs as surrogate models for BO in terms of both sampling efficiency and computational cost.
Sparse MoEs meet Efficient Ensembles
Allingham, James Urquhart, Wenzel, Florian, Mariet, Zelda E, Mustafa, Basil, Puigcerver, Joan, Houlsby, Neil, Jerfel, Ghassen, Fortuin, Vincent, Lakshminarayanan, Balaji, Snoek, Jasper, Tran, Dustin, Ruiz, Carlos Riquelme, Jenatton, Rodolphe
Machine learning models based on the aggregated outputs of submodels, either at the activation or prediction levels, lead to strong performance. We study the interplay of two popular classes of such models: ensembles of neural networks and sparse mixture of experts (sparse MoEs). First, we show that these two approaches have complementary features whose combination is beneficial. Then, we present partitioned batch ensembles, an efficient ensemble of sparse MoEs that takes the best of both classes of models. Extensive experiments on fine-tuned vision transformers demonstrate the accuracy, log-likelihood, few-shot learning, robustness, and uncertainty calibration improvements of our approach over several challenging baselines. Partitioned batch ensembles not only scale to models with up to 2.7B parameters, but also provide larger performance gains for larger models.
Automatic prior selection for meta Bayesian optimization with a case study on tuning deep neural network optimizers
Wang, Zi, Dahl, George E., Swersky, Kevin, Lee, Chansoo, Mariet, Zelda, Nado, Zack, Gilmer, Justin, Snoek, Jasper, Ghahramani, Zoubin
The performance of deep neural networks can be highly sensitive to the choice of a variety of meta-parameters, such as optimizer parameters and model hyperparameters. Tuning these well, however, often requires extensive and costly experimentation. Bayesian optimization (BO) is a principled approach to solve such expensive hyperparameter tuning problems efficiently. Key to the performance of BO is specifying and refining a distribution over functions, which is used to reason about the optima of the underlying function being optimized. In this work, we consider the scenario where we have data from similar functions that allows us to specify a tighter distribution a priori. Specifically, we focus on the common but potentially costly task of tuning optimizer parameters for training neural networks. Building on the meta BO method from Wang et al. (2018), we develop practical improvements that (a) boost its performance by leveraging tuning results on multiple tasks without requiring observations for the same meta-parameter points across all tasks, and (b) retain its regret bound for a special case of our method. As a result, we provide a coherent BO solution for iterative optimization of continuous optimizer parameters. To verify our approach in realistic model training setups, we collected a large multi-task hyperparameter tuning dataset by training tens of thousands of configurations of near-state-of-the-art models on popular image and text datasets, as well as a protein sequence dataset. Our results show that on average, our method is able to locate good hyperparameters at least 3 times more efficiently than the best competing methods.
Hyperparameter Ensembles for Robustness and Uncertainty Quantification
Wenzel, Florian, Snoek, Jasper, Tran, Dustin, Jenatton, Rodolphe
Ensembles over neural network weights trained from different random initialization, known as deep ensembles, achieve state-of-the-art accuracy and calibration. The recently introduced batch ensembles provide a drop-in replacement that is more parameter efficient. In this paper, we design ensembles not only over weights, but over hyperparameters to improve the state of the art in both settings. For best performance independent of budget, we propose hyper-deep ensembles, a simple procedure that involves a random search over different hyperparameters, themselves stratified across multiple random initializations. Its strong performance highlights the benefit of combining models with both weight and hyperparameter diversity. We further propose a parameter efficient version, hyper-batch ensembles, which builds on the layer structure of batch ensembles and self-tuning networks. The computational and memory costs of our method are notably lower than typical ensembles. On image classification tasks, with MLP, LeNet, ResNet 20 and Wide ResNet 28-10 architectures, we improve upon both deep and batch ensembles.
A Spectral Energy Distance for Parallel Speech Synthesis
Gritsenko, Alexey A., Salimans, Tim, Berg, Rianne van den, Snoek, Jasper, Kalchbrenner, Nal
Speech synthesis is an important practical generative modeling problem that has seen great progress over the last few years, with likelihood-based autoregressive neural models now outperforming traditional concatenative systems. A downside of such autoregressive models is that they require executing tens of thousands of sequential operations per second of generated audio, making them ill-suited for deployment on specialized deep learning hardware. Here, we propose a new learning method that allows us to train highly parallel models of speech, without requiring access to an analytical likelihood function. Our approach is based on a generalized energy distance between the distributions of the generated and real audio. This spectral energy distance is a proper scoring rule with respect to the distribution over magnitude-spectrograms of the generated waveform audio and offers statistical consistency guarantees. The distance can be calculated from minibatches without bias, and does not involve adversarial learning, yielding a stable and consistent method for training implicit generative models. Empirically, we achieve state-of-the-art generation quality among implicit generative models, as judged by the recently-proposed cFDSD metric. When combining our method with adversarial techniques, we also improve upon the recently-proposed GAN-TTS model in terms of Mean Opinion Score as judged by trained human evaluators.