Goto

Collaborating Authors

 Saremi, Saeed


Unnormalized Variational Bayes

arXiv.org Machine Learning

We unify empirical Bayes and variational Bayes for approximating unnormalized densities. This framework, named unnormalized variational Bayes (UVB), is based on formulating a latent variable model for the random variable $Y=X+N(0,\sigma^2 I_d)$ and using the evidence lower bound (ELBO), computed by a variational autoencoder, as a parametrization of the energy function of $Y$ which is then used to estimate $X$ with the empirical Bayes least-squares estimator. In this intriguing setup, the $\textit{gradient}$ of the ELBO with respect to noisy inputs plays the central role in learning the energy function. Empirically, we demonstrate that UVB has a higher capacity to approximate energy functions than the parametrization with MLPs as done in neural empirical Bayes (DEEN). We especially showcase $\sigma=1$, where the differences between UVB and DEEN become visible and qualitative in the denoising experiments. For this high level of noise, the distribution of $Y$ is very smoothed and we demonstrate that one can traverse in a single run $-$ without a restart $-$ all MNIST classes in a variety of styles via walk-jump sampling with a fast-mixing Langevin MCMC sampler. We finish by probing the encoder/decoder of the trained models and confirm UVB $\neq$ VAE.


Learning and Inference in Imaginary Noise Models

arXiv.org Machine Learning

Inspired by recent developments in learning smoothed densities with empirical Bayes, we study variational autoencoders with a decoder that is tailored for the random variable $Y=X+N(0,\sigma^2 I_d)$. A notion of smoothed variational inference emerges where the smoothing is implicitly enforced by the noise model of the decoder; "implicit", since during training the encoder only sees clean samples. This is the concept of imaginary noise model, where the noise model dictates the functional form of the variational lower bound $\mathcal{L}(\sigma)$, but the noisy data are never seen during learning. The model is named $\sigma$-VAE. We prove that all $\sigma$-VAEs are equivalent to each other via a simple $\beta$-VAE expansion: $\mathcal{L}(\sigma_2) \equiv \mathcal{L}(\sigma_1,\beta)$, where $\beta=\sigma_2^2/\sigma_1^2$. We prove a similar result for the Laplace distribution in exponential families. Empirically, we report an intriguing power law $\mathcal{D}_{\rm KL} \sim \sigma^{-\nu}$ for the learned models and we study the inference in the $\sigma$-VAE for unseen noisy data. The experiments were performed on MNIST, where we show that quite remarkably the model can make reasonable inferences on extremely noisy samples even though it has not seen any during training. The vanilla VAE completely breaks down in this regime. We finish with a hypothesis (the XYZ hypothesis) on the findings here.


Provable Robust Classification via Learned Smoothed Densities

arXiv.org Machine Learning

Smoothing classifiers and probability density functions with Gaussian kernels appear unrelated, but in this work, they are unified for the problem of robust classification. The key building block is approximating the $\textit{energy function}$ of the random variable $Y=X+N(0,\sigma^2 I_d)$ with a neural network which we use to formulate the problem of robust classification in terms of $\widehat{x}(Y)$, the $\textit{Bayes estimator}$ of $X$ given the noisy measurements $Y$. We introduce $\textit{empirical Bayes smoothed classifiers}$ within the framework of $\textit{randomized smoothing}$ and study it theoretically for the two-class linear classifier, where we show one can improve their robustness above $\textit{the margin}$. We test the theory on MNIST and we show that with a learned smoothed energy function and a linear classifier we can achieve provable $\ell_2$ robust accuracies that are competitive with empirical defenses. This setup can be significantly improved by $\textit{learning}$ empirical Bayes smoothed classifiers with adversarial training and on MNIST we show that we can achieve provable robust accuracies higher than the state-of-the-art empirical defenses in a range of radii. We discuss some fundamental challenges of randomized smoothing based on a geometric interpretation due to concentration of Gaussians in high dimensions, and we finish the paper with a proposal for using walk-jump sampling, itself based on learned smoothed densities, for robust classification.


On approximating $\nabla f$ with neural networks

arXiv.org Machine Learning

Consider a feedforward neural network $\psi: \mathbb{R}^d\rightarrow \mathbb{R}^d$ such that $\psi\approx \nabla f$, where $f:\mathbb{R}^d \rightarrow \mathbb{R}$ is a smooth function, therefore $\psi$ must satisfy $\partial_j \psi_i = \partial_i \psi_j$ pointwise. We prove a theorem that for any such $\psi$ networks, and for any depth $L>2$, all the input weights must be parallel to each other. In other words, $\psi$ can only represent $one$ feature in its first hidden layer. The proof of the theorem is straightforward, where two backward paths (from $i$ to $j$ and $j$ to $i$) and a weight-tying matrix (connecting the last and first hidden layers) play the key roles. We thus make a strong theoretical case in favor of the $implicit$ parametrization, where the neural network is $\phi: \mathbb{R}^d \rightarrow \mathbb{R}$ and $\nabla \phi \approx \nabla f$. Throughout, we revisit two recent unnormalized probabilistic models that are formulated as $\psi \approx \nabla f$ and also discuss the denoising autoencoders in the end.


Neural Empirical Bayes

arXiv.org Machine Learning

We formulate a novel framework that unifies kernel density estimation and empirical Bayes, where we address a broad set of problems in unsupervised learning with a geometric interpretation rooted in the concentration of measure phenomenon. We start by energy estimation based on a denoising objective which recovers the original/clean data X from its measured/noisy version Y with empirical Bayes least squares estimator. The setup is rooted in kernel density estimation, but the log-pdf in Y is parametrized with a neural network, and crucially, the learning objective is derived for any level of noise/kernel bandwidth. Learning is efficient with double backpropagation and stochastic gradient descent. An elegant physical picture emerges of an interacting system of high-dimensional spheres around each data point, together with a globally-defined probability flow field. The picture is powerful: it leads to a novel sampling algorithm, a new notion of associative memory, and it is instrumental in designing experiments. We start with extreme denoising experiments. Walk-jump sampling is defined by Langevin MCMC walks in Y, along with asynchronous empirical Bayes jumps to X. Robbins associative memory is defined by a deterministic flow to attractors of the learned probability flow field. Finally, we observed the emergence of remarkably rich creative modes in the regime of highly overlapping spheres.


Annealed Generative Adversarial Networks

arXiv.org Machine Learning

We introduce a novel framework for adversarial training where the target distribution is annealed between the uniform distribution and the data distribution. We posited a conjecture that learning under continuous annealing in the nonparametric regime is stable irrespective of the divergence measures in the objective function and proposed an algorithm, dubbed {\ss}-GAN, in corollary. In this framework, the fact that the initial support of the generative network is the whole ambient space combined with annealing are key to balancing the minimax game. In our experiments on synthetic data, MNIST, and CelebA, {\ss}-GAN with a fixed annealing schedule was stable and did not suffer from mode collapse.


The Wilson Machine for Image Modeling

arXiv.org Machine Learning

Learning the distribution of natural images is one of the hardest and most important problems in machine learning. The problem remains open, because the enormous complexity of the structures in natural images spans all length scales. We break down the complexity of the problem and show that the hierarchy of structures in natural images fuels a new class of learning algorithms based on the theory of critical phenomena and stochastic processes. We approach this problem from the perspective of the theory of critical phenomena, which was developed in condensed matter physics to address problems with infinite length-scale fluctuations, and build a framework to integrate the criticality of natural images into a learning algorithm. The problem is broken down by mapping images into a hierarchy of binary images, called bitplanes. In this representation, the top bitplane is critical, having fluctuations in structures over a vast range of scales. The bitplanes below go through a gradual stochastic heating process to disorder. We turn this representation into a directed probabilistic graphical model, transforming the learning problem into the unsupervised learning of the distribution of the critical bitplane and the supervised learning of the conditional distributions for the remaining bitplanes. We learnt the conditional distributions by logistic regression in a convolutional architecture. Conditioned on the critical binary image, this simple architecture can generate large, natural-looking images, with many shades of gray, without the use of hidden units, unprecedented in the studies of natural images. The framework presented here is a major step in bringing criticality and stochastic processes to machine learning and in studying natural image statistics.