Uncertainty
Mondrian Forests for Large-Scale Regression when Uncertainty Matters
Lakshminarayanan, Balaji, Roy, Daniel M., Teh, Yee Whye
Many real-world regression problems demand a measure of the uncertainty associated with each prediction. Standard decision forests deliver efficient state-of-the-art predictive performance, but high-quality uncertainty estimates are lacking. Gaussian processes (GPs) deliver uncertainty estimates, but scaling GPs to large-scale data sets comes at the cost of approximating the uncertainty estimates. We extend Mondrian forests, first proposed by Lakshminarayanan et al. (2014) for classification problems, to the large-scale non-parametric regression setting. Using a novel hierarchical Gaussian prior that dovetails with the Mondrian forest framework, we obtain principled uncertainty estimates, while still retaining the computational advantages of decision forests. Through a combination of illustrative examples, real-world large-scale datasets, and Bayesian optimization benchmarks, we demonstrate that Mondrian forests outperform approximate GPs on large-scale regression tasks and deliver better-calibrated uncertainty assessments than decision-forest-based methods.
Provable Algorithms for Inference in Topic Models
Arora, Sanjeev, Ge, Rong, Koehler, Frederic, Ma, Tengyu, Moitra, Ankur
Recently, there has been considerable progress on designing algorithms with provable guarantees -- typically using linear algebraic methods -- for parameter learning in latent variable models. But designing provable algorithms for inference has proven to be more challenging. Here we take a first step towards provable inference in topic models. We leverage a property of topic models that enables us to construct simple linear estimators for the unknown topic proportions that have small variance, and consequently can work with short documents. Our estimators also correspond to finding an estimate around which the posterior is well-concentrated. We show lower bounds that for shorter documents it can be information theoretically impossible to find the hidden topics. Finally, we give empirical results that demonstrate that our algorithm works on realistic topic models. It yields good solutions on synthetic data and runs in time comparable to a {\em single} iteration of Gibbs sampling.
Combinatorial Topic Models using Small-Variance Asymptotics
Jiang, Ke, Sra, Suvrit, Kulis, Brian
Topic models have emerged as fundamental tools in unsupervised machine learning. Most modern topic modeling algorithms take a probabilistic view and derive inference algorithms based on Latent Dirichlet Allocation (LDA) or its variants. In contrast, we study topic modeling as a combinatorial optimization problem, and propose a new objective function derived from LDA by passing to the small-variance limit. We minimize the derived objective by using ideas from combinatorial optimization, which results in a new, fast, and high-quality topic modeling algorithm. In particular, we show that our results are competitive with popular LDA-based topic modeling approaches, and also discuss the (dis)similarities between our approach and its probabilistic counterparts.
Posterior Dispersion Indices
Kucukelbir, Alp, Blei, David M.
Probabilistic modeling is cyclical: we specify a model, infer its posterior, and evaluate its performance. Evaluation drives the cycle, as we revise our model based on how it performs. This requires a metric. Traditionally, predictive accuracy prevails. Yet, predictive accuracy does not tell the whole story. We propose to evaluate a model through posterior dispersion. The idea is to analyze how each datapoint fares in relation to posterior uncertainty around the hidden structure. We propose a family of posterior dispersion indices (PDI) that capture this idea. A PDI identifies rich patterns of model mismatch in three real data examples: voting preferences, supermarket shopping, and population genetics.
Consistency Analysis for the Doubly Stochastic Dirichlet Process
Sun, Xing, Yung, Nelson H. C., Lam, Edmund Y., So, Hayden K. -H.
This technical report proves components consistency for the Doubly Stochastic Dirichlet Process [1] with exponential convergence of posterior probability. We also present the fundamental properties for DSDP as well as inference algorithms. This report is also a support document for the paper "Computationally Efficient Hyperspectral Data Learning Based on the Doubly Stochastic Dirichlet Process" [1]. The probability of data partitions is important in mixture modeling [2]. LetM be the unordered partition ofn observations, then the probability mass function [3] ofM follows.
Semiparametric energy-based probabilistic models
Humplik, Jan, Tkaฤik, Gaลกper
Probabilistic models can be defined by an energy function, where the probability of each state is proportional to the exponential of the state's negative energy. This paper considers a generalization of energy-based models in which the probability of a state is proportional to an arbitrary positive, strictly decreasing, and twice differentiable function of the state's energy. The precise shape of the nonlinear map from energies to unnormalized probabilities has to be learned from data together with the parameters of the energy function. As a case study we show that the above generalization of a fully visible Boltzmann machine yields an accurate model of neural activity of retinal ganglion cells. We attribute this success to the model's ability to easily capture distributions whose probabilities span a large dynamic range, a possible consequence of latent variables that globally couple the system. Similar features have recently been observed in many datasets, suggesting that our new method has wide applicability.
Learning Nonparametric Forest Graphical Models with Prior Information
Zhu, Yuancheng, Liu, Zhe, Sun, Siqi
We present a framework for incorporating prior information into nonparametric estimation of graphical models. To avoid distributional assumptions, we restrict the graph to be a forest and build on the work of forest density estimation (FDE). We reformulate the FDE approach from a Bayesian perspective, and introduce prior distributions on the graphs. As two concrete examples, we apply this framework to estimating scale-free graphs and learning multiple graphs with similar structures. The resulting algorithms are equivalent to finding a maximum spanning tree of a weighted graph with a penalty term on the connectivity pattern of the graph. We solve the optimization problem via a minorize-maximization procedure with Kruskal's algorithm. Simulations show that the proposed methods outperform competing parametric methods, and are robust to the true data distribution. They also lead to improvement in predictive power and interpretability in two real data sets.
Bidirectional Helmholtz Machines
Bornschein, Jorg, Shabanian, Samira, Fischer, Asja, Bengio, Yoshua
Efficient unsupervised training and inference in deep generative models remains a challenging problem. One basic approach, called Helmholtz machine, involves training a top-down directed generative model together with a bottom-up auxiliary model used for approximate inference. Recent results indicate that better generative models can be obtained with better approximate inference procedures. Instead of improving the inference procedure, we here propose a new model which guarantees that the top-down and bottom-up distributions can efficiently invert each other. We achieve this by interpreting both the top-down and the bottom-up directed models as approximate inference distributions and by defining the model distribution to be the geometric mean of these two. We present a lower-bound for the likelihood of this model and we show that optimizing this bound regularizes the model so that the Bhattacharyya distance between the bottom-up and top-down approximate distributions is minimized. This approach results in state of the art generative models which prefer significantly deeper architectures while it allows for orders of magnitude more efficient approximate inference.
Bayesian Model Selection of Stochastic Block Models
Abstract--A central problem in analyzing networks is partitioning them into modules or communities. One of the best tools for this is the stochastic block model, which clusters vertices into blocks with statistically homogeneous pattern of links. Despite its flexibility and popularity, there has been a lack of principled statistical model selection criteria for the stochastic block model. Here we propose a Bayesian framework for choosing the number of blocks as well as comparing it to the more elaborate degree-corrected block models, ultimately leading to a universal model selection framework capable of comparing multiple modeling combinations. We will also investigate its connection to the minimum description length principle. I NTRODUCTION An important task in network analysis is community detection, or finding groups of similar vertices which can then be analyzed separately [1]. Community structures offer clues to the processes which generated the graph, on scales ranging from face-to-face social interaction [2] through social-media communications [3] to the organization of food webs [4]. However, previous work often defines a "community" as a group of vertices with high density of connections within the group and a low density of connections to the rest of the network. While this type of assortative community structure is generally the case in social networks, we are interested in a more general definition of functional community--a group of vertices that connect to the rest of the network in similar ways. A set of similar predators form a functional group in a food web, not because they eat each other, but because they feed on similar prey.
Active Uncertainty Calibration in Bayesian ODE Solvers
Kersting, Hans, Hennig, Philipp
There is resurging interest, in statistics and machine learning, in solvers for ordinary differential equations (ODEs) that return probability measures instead of point estimates. Recently, Conrad et al. introduced a sampling-based class of methods that are 'well-calibrated' in a specific sense. But the computational cost of these methods is significantly above that of classic methods. On the other hand, Schober et al. pointed out a precise connection between classic Runge-Kutta ODE solvers and Gaussian filters, which gives only a rough probabilistic calibration, but at negligible cost overhead. By formulating the solution of ODEs as approximate inference in linear Gaussian SDEs, we investigate a range of probabilistic ODE solvers, that bridge the trade-off between computational cost and probabilistic calibration, and identify the inaccurate gradient measurement as the crucial source of uncertainty. We propose the novel filtering-based method Bayesian Quadrature filtering (BQF) which uses Bayesian quadrature to actively learn the imprecision in the gradient measurement by collecting multiple gradient evaluations.