Collaborating Authors

Projected BNNs: Avoiding weight-space pathologies by learning latent representations of neural network weights Machine Learning

Deep learning provides a flexible framework for function approximation and, as a result, deep models have become a standard approach in many domains including machine vision, natural language processing, speech recognition, bioinformatics, and game-playing [LeCun et al., 2015]. However, deep models tend to overfit when the number of training examples is small; furthermore, in practice, the primary focus in deep learning is often on computing point estimates of model parameters, and thus these models do not provide uncertainties for their predictions - making them unsuitable for applications in critical domains such as personalized medicine. Bayesian neural networks (BNN) promise to address these issues by modeling the uncertainty in the network weights, and correspondingly, the uncertainty in output predictions[MacKay, 1992b, Neal, 2012]. Unfortunately, characterizing uncertainty over parameters of modern neural networks in a Bayesian setting is challenging due to the high-dimensionality of the weight space and complex patterns of dependencies among the weights. In these cases, Markov-chain Monte Carlo (MCMC) techniques for performing inference often fail to mix across the weight space, and standard variational approaches not only struggle to escape local optima, but also fail to capture dependencies between the weights. A recent body of work has attempted to improve the quality of inference for Bayesian neural networks (BNNs) via improved approximate inference methods [Graves, 2011, Blundell et al., 2015, Hernández-Lobato et al., 2016], or by improving the flexibility of the variational approximation for variational inference [Gershman et al., 2012, Ranganath et al., 2016, Louizos and Welling, 2017]. In this work, we introduce a novel approach in which we remove potential redundancies in neural network parameters by learning a nonlinear projection of the weights onto a low-dimensional latent space. Our approach takes advantage of the following insight: learning (standard network) parameters is easier in the high-dimensional space, but characterizing (Bayesian) uncertainty is easier in the 1 low-dimensional space. Low-dimensional spaces are generally easier to explore, especially if we have fewer correlations between dimensions, and can be better captured by standard variational approximations (e.g.

Randomized Value Functions via Multiplicative Normalizing Flows Machine Learning

Randomized value functions offer a promising approach towards the challenge of efficient exploration in complex environments with high dimensional state and action spaces. Unlike traditional point estimate methods, randomized value functions maintain a posterior distribution over action-space values. This prevents the agent's behavior policy from prematurely exploiting early estimates and falling into local optima. In this work, we leverage recent advances in variational Bayesian neural networks and combine these with traditional Deep Q-Networks (DQN) to achieve randomized value functions for high-dimensional domains. In particular, we augment DQN with multiplicative normalizing flows in order to track an approximate posterior distribution over its parameters. This allows the agent to perform approximate Thompson sampling in a computationally efficient manner via stochastic gradient methods. We demonstrate the benefits of our approach through an empirical comparison in high dimensional environments.

A statistical theory of cold posteriors in deep neural networks Machine Learning

To get Bayesian neural networks to perform comparably to standard neural networks it is usually necessary to artificially reduce uncertainty using a "tempered" or "cold" posterior. This is extremely concerning: if the prior is accurate, Bayes inference/decision theory is optimal, and any artificial changes to the posterior should harm performance. While this suggests that the prior may be at fault, here we argue that in fact, BNNs for image classification use the wrong likelihood. In particular, standard image benchmark datasets such as CIFAR-10 are carefully curated. We develop a generative model describing curation which gives a principled Bayesian account of cold posteriors, because the likelihood under this new generative model closely matches the tempered likelihoods used in past work.

Radial and Directional Posteriors for Bayesian Neural Networks Machine Learning

We propose a new variational family for Bayesian neural networks. We decompose the variational posterior into two components, where the radial component captures the strength of each neuron in terms of its magnitude; while the directional component captures the statistical dependencies among the weight parameters. The dependencies learned via the directional density provide better modeling performance compared to the widely-used Gaussian mean-field-type variational family. In addition, the strength of input and output neurons learned via the radial density provides a structured way to compress neural networks. Indeed, experiments show that our variational family improves predictive performance and yields compressed networks simultaneously.

NCP-VAE: Variational Autoencoders with Noise Contrastive Priors Machine Learning

Variational autoencoders (VAEs) are one of the powerful likelihood-based generative models with applications in various domains. However, they struggle to generate high-quality images, especially when samples are obtained from the prior without any tempering. One explanation for VAEs' poor generative quality is the prior hole problem: the prior distribution fails to match the aggregate approximate posterior. Due to this mismatch, there exist areas in the latent space with high density under the prior that do not correspond to any encoded image. Samples from those areas are decoded to corrupted images. To tackle this issue, we propose an energy-based prior defined by the product of a base prior distribution and a reweighting factor, designed to bring the base closer to the aggregate posterior. We train the reweighting factor by noise contrastive estimation, and we generalize it to hierarchical VAEs with many latent variable groups. Our experiments confirm that the proposed noise contrastive priors improve the generative performance of state-of-the-art VAEs by a large margin on the MNIST, CIFAR-10, CelebA 64, and CelebA HQ 256 datasets.