Learning Graphical Models
A Scale Mixture Perspective of Multiplicative Noise in Neural Networks
Nalisnick, Eric, Anandkumar, Anima, Smyth, Padhraic
Corrupting the input and hidden layers of deep neural networks (DNNs) with multiplicative noise, often drawn from the Bernoulli distribution (or 'dropout'), provides regularization that has significantly contributed to deep learning's success. However, understanding how multiplicative corruptions prevent overfitting has been difficult due to the complexity of a DNN's functional form. In this paper, we show that when a Gaussian prior is placed on a DNN's weights, applying multiplicative noise induces a Gaussian scale mixture, which can be reparameterized to circumvent the problematic likelihood function. Analysis can then proceed by using a type-II maximum likelihood procedure to derive a closed-form expression revealing how regularization evolves as a function of the network's weights. Results show that multiplicative noise forces weights to become either sparse or invariant to rescaling. We find our analysis has implications for model compression as it naturally reveals a weight pruning rule that starkly contrasts with the commonly used signal-to-noise ratio (SNR). While the SNR prunes weights with large variances, seeing them as noisy, our approach recognizes their robustness and retains them. We empirically demonstrate our approach has a strong advantage over the SNR heuristic and is competitive to retraining with soft targets produced from a teacher model.
A Linear-Time Particle Gibbs Sampler for Infinite Hidden Markov Models
Tripuraneni, Nilesh, Gu, Shane, Ge, Hong, Ghahramani, Zoubin
Infinite Hidden Markov Models (iHMM's) are an attractive, nonparametric generalization of the classical Hidden Markov Model which can automatically infer the number of hidden states in the system. However, due to the infinite-dimensional nature of transition dynamics performing inference in the iHMM is difficult. In this paper, we present an infinite-state Particle Gibbs (PG) algorithm to resample state trajectories for the iHMM. The proposed algorithm uses an efficient proposal optimized for iHMMs and leverages ancestor sampling to suppress degeneracy of the standard PG algorithm. Our algorithm demonstrates significant convergence improvements on synthetic and real world data sets. Additionally, the infinite-state PG algorithm has linear-time complexity in the number of states in the sampler, while competing methods scale quadratically.
Variational consensus Monte Carlo
Rabinovich, Maxim, Angelino, Elaine, Jordan, Michael I.
Practitioners of Bayesian statistics have long depended on Markov chain Monte Carlo (MCMC) to obtain samples from intractable posterior distributions. Unfortunately, MCMC algorithms are typically serial, and do not scale to the large datasets typical of modern machine learning. The recently proposed consensus Monte Carlo algorithm removes this limitation by partitioning the data and drawing samples conditional on each partition in parallel (Scott et al, 2013). A fixed aggregation function then combines these samples, yielding approximate posterior samples. We introduce variational consensus Monte Carlo (VCMC), a variational Bayes algorithm that optimizes over aggregation functions to obtain samples from a distribution that better approximates the target. The resulting objective contains an intractable entropy term; we therefore derive a relaxation of the objective and show that the relaxed problem is blockwise concave under mild conditions. We illustrate the advantages of our algorithm on three inference tasks from the literature, demonstrating both the superior quality of the posterior approximation and the moderate overhead of the optimization step. Our algorithm achieves a relative error reduction (measured against serial MCMC) of up to 39% compared to consensus Monte Carlo on the task of estimating 300-dimensional probit regression parameter expectations; similarly, it achieves an error reduction of 92% on the task of estimating cluster comembership probabilities in a Gaussian mixture model with 8 components in 8 dimensions. Furthermore, these gains come at moderate cost compared to the runtime of serial MCMC, achieving near-ideal speedup in some instances.
The Wreath Process: A totally generative model of geometric shape based on nested symmetries
Borsa, Diana, Graepel, Thore, Gordon, Andrew
We consider the problem of modelling noisy but highly symmetric shapes that can be viewed as hierarchies of whole-part relationships in which higher level objects are composed of transformed collections of lower level objects. To this end, we propose the stochastic wreath process, a fully generative probabilistic model of drawings. Following Leyton's "Generative Theory of Shape", we represent shapes as sequences of transformation groups composed through a wreath product. This representation emphasizes the maximization of transfer --- the idea that the most compact and meaningful representation of a given shape is achieved by maximizing the re-use of existing building blocks or parts. The proposed stochastic wreath process extends Leyton's theory by defining a probability distribution over geometric shapes in terms of noise processes that are aligned with the generative group structure of the shape. We propose an inference scheme for recovering the generative history of given images in terms of the wreath process using reversible jump Markov chain Monte Carlo methods and Approximate Bayesian Computation. In the context of sketching we demonstrate the feasibility and limitations of this approach on model-generated and real data.
Learning Mixtures of Ising Models using Pseudolikelihood
Maximum pseudolikelihood method has been among the most important methods for learning parameters of statistical physics models, such as Ising models. In this paper, we study how pseudolikelihood can be derived for learning parameters of a mixture of Ising models. The performance of the proposed approach is demonstrated for Ising and Potts models on both synthetic and real data.
Convergence Rates of Active Learning for Maximum Likelihood Estimation
Chaudhuri, Kamalika, Kakade, Sham, Netrapalli, Praneeth, Sanghavi, Sujay
In active learning, we are given a sample space X, a label space Y, a class of models that map X to Y, and a large set U of unlabelled samples. The goal of the learner is to learn a model in the class with small target error while interactively querying the labels of as few of the unlabelled samples as possible. Most theoretical work on active learning has focussed on the PAC or the agnostic PAC model, where the goal is to learn binary classifiers that belong to a particular hypothesis class [2, 13, 9, 6, 3, 4,22], andtherehasbeenonlyahandful ofexceptions[19, 8,20]. Inthispaper, weshift ourattention to a more general setting - maximum likelihood estimation (MLE), where Pr(Y X) is described by a model θ belonging to a model class Θ. We show that when data is generated by a model in this class, we can do active learning provided the model class Θ has the following simple property: the Fisher information matrix for any model θ Θ at any (x,y) depends only on x and θ. This condition is satisfied in a number of widely applicable model classes, such as Linear Regression and Generalized Linear Models (GLMs), which in turn includes models for Multiclass Classification and Conditional 1 Random Fields. Consequently, we can provide active learning algorithms for maximum likelihood estimation in all these model classes. The standard solution to active MLE estimation in the statistics literature is to select samples for label query by optimizing a class of summary statistics of the asymptotic covariance matrix of the estimator [5]. The literature, however, does not provide any guidance towards which summary statistic should be used, or any analysis of the solution quality when a finite number of labels or samples are available.
Population Empirical Bayes
Kucukelbir, Alp, Blei, David M.
Bayesian predictive inference analyzes a dataset to make predictions about new observations. When a model does not match the data, predictive accuracy suffers. We develop population empirical Bayes (POP-EB), a hierarchical framework that explicitly models the empirical population distribution as part of Bayesian analysis. We introduce a new concept, the latent dataset, as a hierarchical variable and set the empirical population as its prior. This leads to a new predictive density that mitigates model mismatch. We efficiently apply this method to complex models by proposing a stochastic variational inference algorithm, called bumping variational inference (BUMP-VI). We demonstrate improved predictive accuracy over classical Bayesian inference in three models: a linear regression model of health data, a Bayesian mixture model of natural images, and a latent Dirichlet allocation topic model of scientific documents.
Fast Mixing for Discrete Point Processes
Rebeschini, Patrick, Karbasi, Amin
We investigate the systematic mechanism for designing fast mixing Markov chain Monte Carlo algorithms to sample from discrete point processes under the Dobrushin uniqueness condition for Gibbs measures. Discrete point processes are defined as probability distributions $\mu(S)\propto \exp(\beta f(S))$ over all subsets $S\in 2^V$ of a finite set $V$ through a bounded set function $f:2^V\rightarrow \mathbb{R}$ and a parameter $\beta>0$. A subclass of discrete point processes characterized by submodular functions (which include log-submodular distributions, submodular point processes, and determinantal point processes) has recently gained a lot of interest in machine learning and shown to be effective for modeling diversity and coverage. We show that if the set function (not necessarily submodular) displays a natural notion of decay of correlation, then, for $\beta$ small enough, it is possible to design fast mixing Markov chain Monte Carlo methods that yield error bounds on marginal approximations that do not depend on the size of the set $V$. The sufficient conditions that we derive involve a control on the (discrete) Hessian of set functions, a quantity that has not been previously considered in the literature. We specialize our results for submodular functions, and we discuss canonical examples where the Hessian can be easily controlled.
JUMP-Means: Small-Variance Asymptotics for Markov Jump Processes
Huggins, Jonathan H., Narasimhan, Karthik, Saeedi, Ardavan, Mansinghka, Vikash K.
Markov jump processes (MJPs) are used to model a wide range of phenomena from disease progression to RNA path folding. However, maximum likelihood estimation of parametric models leads to degenerate trajectories and inferential performance is poor in nonparametric models. We take a small-variance asymptotics (SVA) approach to overcome these limitations. We derive the small-variance asymptotics for parametric and nonparametric MJPs for both directly observed and hidden state models. In the parametric case we obtain a novel objective function which leads to non-degenerate trajectories. To derive the nonparametric version we introduce the gamma-gamma process, a novel extension to the gamma-exponential process. We propose algorithms for each of these formulations, which we call \emph{JUMP-means}. Our experiments demonstrate that JUMP-means is competitive with or outperforms widely used MJP inference approaches in terms of both speed and reconstruction accuracy.
MADE: Masked Autoencoder for Distribution Estimation
Germain, Mathieu, Gregor, Karol, Murray, Iain, Larochelle, Hugo
There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.