Learning Graphical Models
Sharp Computational-Statistical Phase Transitions via Oracle Computational Model
Wang, Zhaoran, Gu, Quanquan, Liu, Han
We study the fundamental tradeoffs between computational tractability and statistical accuracy for a general family of hypothesis testing problems with combinatorial structures. Based upon an oracle model of computation, which captures the interactions between algorithms and data, we establish a general lower bound that explicitly connects the minimum testing risk under computational budget constraints with the intrinsic probabilistic and combinatorial structures of statistical problems. This lower bound mirrors the classical statistical lower bound by Le Cam (1986) and allows us to quantify the optimal statistical performance achievable given limited computational budgets in a systematic fashion. Under this unified framework, we sharply characterize the statistical-computational phase transition for two testing problems, namely, normal mean detection and sparse principal component detection. For normal mean detection, we consider two combinatorial structures, namely, sparse set and perfect matching. For these problems we identify significant gaps between the optimal statistical accuracy that is achievable under computational tractability constraints and the classical statistical lower bounds. Compared with existing works on computational lower bounds for statistical problems, which consider general polynomial-time algorithms on Turing machines, and rely on computational hardness hypotheses on problems like planted clique detection, we focus on the oracle computational model, which covers a broad range of popular algorithms, and do not rely on unproven hypotheses. Moreover, our result provides an intuitive and concrete interpretation for the intrinsic computational intractability of high-dimensional statistical problems. One byproduct of our result is a lower bound for a strict generalization of the matrix permanent problem, which is of independent interest.
The Poisson Gamma Belief Network
Zhou, Mingyuan, Cong, Yulai, Chen, Bo
To infer a multilayer representation of high-dimensional count vectors, we propose the Poisson gamma belief network (PGBN) that factorizes each of its layers into the product of a connection weight matrix and the nonnegative real hidden units of the next layer. The PGBN's hidden layers are jointly trained with an upward-downward Gibbs sampler, each iteration of which upward samples Dirichlet distributed connection weight vectors starting from the first layer (bottom data layer), and then downward samples gamma distributed hidden units starting from the top hidden layer. The gamma-negative binomial process combined with a layer-wise training strategy allows the PGBN to infer the width of each layer given a fixed budget on the width of the first layer. The PGBN with a single hidden layer reduces to Poisson factor analysis. Example results on text analysis illustrate interesting relationships between the width of the first layer and the inferred network structure, and demonstrate that the PGBN, whose hidden units are imposed with correlated gamma priors, can add more layers to increase its performance gains over Poisson factor analysis, given the same limit on the width of the first layer.
Infinite Edge Partition Models for Overlapping Community Detection and Link Prediction
A hierarchical gamma process infinite edge partition model is proposed to factorize the binary adjacency matrix of an unweighted undirected relational network under a Bernoulli-Poisson link. The model describes both homophily and stochastic equivalence, and is scalable to big sparse networks by focusing its computation on pairs of linked nodes. It can not only discover overlapping communities and inter-community interactions, but also predict missing edges. A simplified version omitting inter-community interactions is also provided and we reveal its interesting connections to existing models. The number of communities is automatically inferred in a nonparametric Bayesian manner, and efficient inference via Gibbs sampling is derived using novel data augmentation techniques. Experimental results on four real networks demonstrate the models' scalability and state-of-the-art performance.
Conditional probability generation methods for high reliability effects-based decision making
Garn, Wolfgang, Louvieris, Panos
Decision making is often based on Bayesian networks. The building blocks for Bayesian networks are its conditional probability tables (CPTs). These tables are obtained by parameter estimation methods, or they are elicited from subject matter experts (SME). Some of these knowledge representations are insufficient approximations. Using knowledge fusion of cause and effect observations lead to better predictive decisions. We propose three new methods to generate CPTs, which even work when only soft evidence is provided. The first two are novel ways of mapping conditional expectations to the probability space. The third is a column extraction method, which obtains CPTs from nonlinear functions such as the multinomial logistic regression. Case studies on military effects and burnt forest desertification have demonstrated that so derived CPTs have highly reliable predictive power, including superiority over the CPTs obtained from SMEs. In this context, new quality measures for determining the goodness of a CPT and for comparing CPTs with each other have been introduced. The predictive power and enhanced reliability of decision making based on the novel CPT generation methods presented in this paper have been confirmed and validated within the context of the case studies.
Mining Massive Hierarchical Data Using a Scalable Probabilistic Graphical Model
AlJadda, Khalifeh, Korayem, Mohammed, Ortiz, Camilo, Grainger, Trey, Miller, John A., Rasheed, Khaled, Kochut, Krys J., York, William S., Ranzinger, Rene, Porterfield, Melody
Probabilistic Graphical Models (PGM) are very useful in the fields of machine learning and data mining. The crucial limitation of those models,however, is the scalability. The Bayesian Network, which is one of the most common PGMs used in machine learning and data mining, demonstrates this limitation when the training data consists of random variables, each of them has a large set of possible values. In the big data era, one would expect new extensions to the existing PGMs to handle the massive amount of data produced these days by computers, sensors and other electronic devices. With hierarchical data - data that is arranged in a treelike structure with several levels - one would expect to see hundreds of thousands or millions of values distributed over even just a small number of levels. When modeling this kind of hierarchical data across large data sets, Bayesian Networks become infeasible for representing the probability distributions. In this paper we introduce an extension to Bayesian Networks to handle massive sets of hierarchical data in a reasonable amount of time and space. The proposed model achieves perfect precision of 1.0 and high recall of 0.93 when it is used as multi-label classifier for the annotation of mass spectrometry data. On another data set of 1.5 billion search logs provided by CareerBuilder.com the model was able to predict latent semantic relationships between search keywords with accuracy up to 0.80.
Statistical and Computational Guarantees for the Baum-Welch Algorithm
Yang, Fanny, Balakrishnan, Sivaraman, Wainwright, Martin J.
The Hidden Markov Model (HMM) is one of the mainstays of statistical modeling of discrete time series, with applications including speech recognition, computational biology, computer vision and econometrics. Estimating an HMM from its observation process is often addressed via the Baum-Welch algorithm, which is known to be susceptible to local optima. In this paper, we first give a general characterization of the basin of attraction associated with any global optimum of the population likelihood. By exploiting this characterization, we provide non-asymptotic finite sample guarantees on the Baum-Welch updates, guaranteeing geometric convergence to a small ball of radius on the order of the minimax rate around a global optimum. As a concrete example, we prove a linear rate of convergence for a hidden Markov mixture of two isotropic Gaussians given a suitable mean separation and an initialization within a ball of large radius around (one of) the true parameters. To our knowledge, these are the first rigorous local convergence guarantees to global optima for the Baum-Welch algorithm in a setting where the likelihood function is nonconvex. We complement our theoretical results with thorough numerical simulations studying the convergence of the Baum-Welch algorithm and illustrating the accuracy of our predictions.
K2-ABC: Approximate Bayesian Computation with Kernel Embeddings
Park, Mijung, Jitkrittum, Wittawat, Sejdinovic, Dino
Complicated generative models often result in a situation where computing the likelihood of observed data is intractable, while simulating from the conditional density given a parameter value is relatively easy. Approximate Bayesian Computation (ABC) is a paradigm that enables simulation-based posterior inference in such cases by measuring the similarity between simulated and observed data in terms of a chosen set of summary statistics. However, there is no general rule to construct sufficient summary statistics for complex models. Insufficient summary statistics will "leak" information, which leads to ABC algorithms yielding samples from an incorrect (partial) posterior. In this paper, we propose a fully nonparametric ABC paradigm which circumvents the need for manually selecting summary statistics. Our approach, K2-ABC, uses maximum mean discrepancy (MMD) to construct a dissimilarity measure between the observed and simulated data. The embedding of an empirical distribution of the data into a reproducing kernel Hilbert space plays a role of the summary statistic and is sufficient whenever the corresponding kernels are characteristic. Experiments on a simulated scenario and a real-world biological problem illustrate the effectiveness of the proposed algorithm. M Park and W Jitkrittum contributed equally.
Using Data Analytics to Detect Anomalous States in Vehicles
Narayanan, Sandeep Nair, Mittal, Sudip, Joshi, Anupam
Vehicles are becoming more and more connected, this opens up a larger attack surface which not only affects the passengers inside vehicles, but also people around them. These vulnerabilities exist because modern systems are built on the comparatively less secure and old CAN bus framework which lacks even basic authentication. Since a new protocol can only help future vehicles and not older vehicles, our approach tries to solve the issue as a data analytics problem and use machine learning techniques to secure cars. We develop a Hidden Markov Model to detect anomalous states from real data collected from vehicles. Using this model, while a vehicle is in operation, we are able to detect and issue alerts. Our model could be integrated as a plug-n-play device in all new and old cars.
Histogram Meets Topic Model: Density Estimation by Mixture of Histograms
The histogram method is a powerful non-parametric approach for estimating the probability density function of a continuous variable. But the construction of a histogram, compared to the parametric approaches, demands a large number of observations to capture the underlying density function. Thus it is not suitable for analyzing a sparse data set, a collection of units with a small size of data. In this paper, by employing the probabilistic topic model, we develop a novel Bayesian approach to alleviating the sparsity problem in the conventional histogram estimation. Our method estimates a unit's density function as a mixture of basis histograms, in which the number of bins for each basis, as well as their heights, is determined automatically. The estimation procedure is performed by using the fast and easy-to-implement collapsed Gibbs sampling. We apply the proposed method to synthetic data, showing that it performs well.
Distinguishing cause from effect using observational data: methods and benchmarks
Mooij, Joris M., Peters, Jonas, Janzing, Dominik, Zscheischler, Jakob, Schölkopf, Bernhard
The discovery of causal relationships from purely observational data is a fundamental problem in science. The most elementary form of such a causal discovery problem is to decide whether X causes Y or, alternatively, Y causes X, given joint observations of two variables X, Y. An example is to decide whether altitude causes temperature, or vice versa, given only joint measurements of both variables. Even under the simplifying assumptions of no confounding, no feedback loops, and no selection bias, such bivariate causal discovery problems are challenging. Nevertheless, several approaches for addressing those problems have been proposed in recent years. We review two families of such methods: Additive Noise Methods (ANM) and Information Geometric Causal Inference (IGCI). We present the benchmark CauseEffectPairs that consists of data for 100 different cause-effect pairs selected from 37 datasets from various domains (e.g., meteorology, biology, medicine, engineering, economy, etc.) and motivate our decisions regarding the "ground truth" causal directions of all pairs. We evaluate the performance of several bivariate causal discovery methods on these real-world benchmark data and in addition on artificially simulated data. Our empirical results on real-world data indicate that certain methods are indeed able to distinguish cause from effect using only purely observational data, although more benchmark data would be needed to obtain statistically significant conclusions. One of the best performing methods overall is the additive-noise method originally proposed by Hoyer et al. (2009), which obtains an accuracy of 63+-10 % and an AUC of 0.74+-0.05 on the real-world benchmark. As the main theoretical contribution of this work we prove the consistency of that method.