Uncertainty
Rebuilding Factorized Information Criterion: Asymptotically Accurate Marginal Likelihood
Hayashi, Kohei, Maeda, Shin-ichi, Fujimaki, Ryohei
The marginal log-likelihood is a key concept of Bayesian model identification of latent variable models (LVMs), such as mixture models (MMs), probabilistic principal component analysis, and hidden Markov models (HMMs). Determination of dimensionality of latent variables is an essential task to uncover hidden structures behind the observed data as well as to mitigate overfitting. In general, LVMs are singular (i.e., mapping between parameters and probabilistic models is not one-to-one) and such classical information criteria based on the regularity assumption as the Bayesian information criterion (BIC) [Schwarz, 1978] are no longer justified. Since exact evaluation of 1 the marginal log-likelihood is often not available, approximation techniques have been developed using sampling (i.e., Markov Chain Monte Carlo methods (MCMCs) [Hastings, 1970]), a variational lower bound (i.e., the variational Bayes methods (VB) [Attias, 1999, Jordan et al., 1999]), or algebraic geometry (i.e., the widely applicable BIC (WBIC) [Watanabe, 2013]). However, model selection using these methods typically requires heavy computational cost (e.g., a large number of MCMC sampling in a high-dimensional space, an outer loop for VB/WBIC.) In the last few years, a new approximation technique and an inference method, factorized information criterion (FIC) and factorized asymptotic Bayesian inference (FAB), have been developed for some binary LVMs [Fujimaki and Morinaga, 2012, Fujimaki and Hayashi, 2012, Hayashi and Fujimaki, 2013, Eto et al., 2014]. Unlike existing methods which evaluate approximated marginal log-likelihoods calculated for each latent variable dimensionality (and therefore need an outer loop for model selection), FAB finds an effective dimensionality via an EMstyle alternating optimization procedure.
Semantic Enrichment of Mobile Phone Data Records Using Background Knowledge
Dashdorj, Zolzaya, Sobolevsky, Stanislav, Serafini, Luciano, Antonelli, Fabrizio, Ratti, Carlo
Every day, billions of mobile network events (i.e. CDRs) are generated by cellular phone operator companies. Latent in this data are inspiring insights about human actions and behaviors, the discovery of which is important because context-aware applications and services hold the key to user-driven, intelligent services, which can enhance our everyday lives such as social and economic development, urban planning, and health prevention. The major challenge in this area is that interpreting such a big stream of data requires a deep understanding of mobile network events' context through available background knowledge. This article addresses the issues in context awareness given heterogeneous and uncertain data of mobile network events missing reliable information on the context of this activity. The contribution of this research is a model from a combination of logical and statistical reasoning standpoints for enabling human activity inference in qualitative terms from open geographical data that aimed at improving the quality of human behaviors recognition tasks from CDRs. We use open geographical data, Openstreetmap (OSM), as a proxy for predicting the content of human activity in the area. The user study performed in Trento shows that predicted human activities (top level) match the survey data with around 93% overall accuracy. The extensive validation for predicting a more specific economic type of human activity performed in Barcelona, by employing credit card transaction data. The analysis identifies that appropriately normalized data on points of interest (POI) is a good proxy for predicting human economical activities, with 84% accuracy on average. So the model is proven to be efficient for predicting the context of human activity, when its total level could be efficiently observed from cell phone data records, missing contextual information however.
A local approach to estimation in discrete loglinear models
We consider two connected aspects of maximum likelihood estimation of the parameter for high-dimensional discrete graphical models: the existence of the maximum likelihood estimate (mle) and its computation. When the data is sparse, there are many zeros in the contingency table and the maximum likelihood estimate of the parameter may not exist. Fienberg and Rinaldo (2012) have shown that the mle does not exists iff the data vector belongs to a face of the so-called marginal cone spanned by the rows of the design matrix of the model. Identifying these faces in high-dimension is challenging. In this paper, we take a local approach : we show that one such face, albeit possibly not the smallest one, can be identified by looking at a collection of marginal graphical models generated by induced subgraphs $G_i,i=1,\ldots,k$ of $G$. This is our first contribution. Our second contribution concerns the composite maximum likelihood estimate. When the dimension of the problem is large, estimating the parameters of a given graphical model through maximum likelihood is challenging, if not impossible. The traditional approach to this problem has been local with the use of composite likelihood based on local conditional likelihoods. A more recent development is to have the components of the composite likelihood be marginal likelihoods centred around each $v$. We first show that the estimates obtained by consensus through local conditional and marginal likelihoods are identical. We then study the asymptotic properties of the composite maximum likelihood estimate when both the dimension of the model and the sample size $N$ go to infinity.
Streaming Variational Inference for Bayesian Nonparametric Mixture Models
Tank, Alex, Foti, Nicholas J., Fox, Emily B.
In theory, Bayesian nonparametric (BNP) models are well suited to streaming data scenarios due to their ability to adapt model complexity with the observed data. Unfortunately, such benefits have not been fully realized in practice; existing inference algorithms are either not applicable to streaming applications or not extensible to BNP models. For the special case of Dirichlet processes, streaming inference has been considered. However, there is growing interest in more flexible BNP models building on the class of normalized random measures (NRMs). We work within this general framework and present a streaming variational inference algorithm for NRM mixture models. Our algorithm is based on assumed density filtering (ADF), leading straightforwardly to expectation propagation (EP) for large-scale batch inference as well. We demonstrate the efficacy of the algorithm on clustering documents in large, streaming text corpora.
Asymptotic Accuracy of Bayesian Estimation for a Single Latent Variable
In data science and machine learning, hierarchical parametric models, such as mixture models, are often used. They contain two kinds of variables: observable variables, which represent the parts of the data that can be directly measured, and latent variables, which represent the underlying processes that generate the data. Although there has been an increase in research on the estimation accuracy for observable variables, the theoretical analysis of estimating latent variables has not been thoroughly investigated. In a previous study, we determined the accuracy of a Bayes estimation for the joint probability of the latent variables in a dataset, and we proved that the Bayes method is asymptotically more accurate than the maximum-likelihood method. However, the accuracy of the Bayes estimation for a single latent variable remains unknown. In the present paper, we derive the asymptotic expansions of the error functions, which are defined by the Kullback-Leibler divergence, for two types of single-variable estimations when the statistical regularity is satisfied. Our results indicate that the accuracies of the Bayes and maximum-likelihood methods are asymptotically equivalent and clarify that the Bayes method is only advantageous for multivariable estimations.
Topic Modeling of Hierarchical Corpora
Kim, Do-kyum, Voelker, Geoffrey M., Saul, Lawrence K.
We study the problem of topic modeling in corpora whose documents are organized in a multi-level hierarchy. We explore a parametric approach to this problem, assuming that the number of topics is known or can be estimated by cross-validation. The models we consider can be viewed as special (finite-dimensional) instances of hierarchical Dirichlet processes (HDPs). For these models we show that there exists a simple variational approximation for probabilistic inference. The approximation relies on a previously unexploited inequality that handles the conditional dependence between Dirichlet latent variables in adjacent levels of the model's hierarchy. We compare our approach to existing implementations of nonparametric HDPs. On several benchmarks we find that our approach is faster than Gibbs sampling and able to learn more predictive models than existing variational methods. Finally, we demonstrate the large-scale viability of our approach on two newly available corpora from researchers in computer security---one with 350,000 documents and over 6,000 internal subcategories, the other with a five-level deep hierarchy.
Geometric Representations of Random Hypergraphs
Lunagรณmez, Simรณn, Mukherjee, Sayan, Wolpert, Robert L., Airoldi, Edoardo M.
A parametrization of hypergraphs based on the geometry of points in $\mathbf{R}^d$ is developed. Informative prior distributions on hypergraphs are induced through this parametrization by priors on point configurations via spatial processes. This prior specification is used to infer conditional independence models or Markov structure of multivariate distributions. Specifically, we can recover both the junction tree factorization as well as the hyper Markov law. This approach offers greater control on the distribution of graph features than Erd\"os-R\'enyi random graphs, supports inference of factorizations that cannot be retrieved by a graph alone, and leads to new Metropolis\slash Hastings Markov chain Monte Carlo algorithms with both local and global moves in graph space. We illustrate the utility of this parametrization and prior specification using simulations.
Knowledge reduction of dynamic covering decision information systems with varying attribute values
Knowledge reduction of dynamic covering information systems involves with the time in practical situations. In this paper, we provide incremental approaches to computing the type-1 and type-2 characteristic matrices of dynamic coverings because of varying attribute values. Then we present incremental algorithms of constructing the second and sixth approximations of sets by using characteristic matrices. We employ experimental results to illustrate that the incremental approaches are effective to calculate approximations of sets in dynamic covering information systems. Finally, we perform knowledge reduction of dynamic covering information systems with the incremental approaches.
Privacy for Free: Posterior Sampling and Stochastic Gradient Monte Carlo
Wang, Yu-Xiang, Fienberg, Stephen E., Smola, Alex
Bayesian models have proven to be one of the most successful classes of tools in machine learning. It stands out as a principled yet conceptually simple pipeline for combining expert knowledge and statistical evidence, modelling with complicated dependency structures and harnessing uncertainty by making probabilistic inferences (Geman & Geman, 1984; Gelman et al., 2014). In the past few decades, the Bayesian approach has been intensively used in modelling speeches (Rabiner, 1989), text documents (Blei et al., 2003), images/videos (Fei-Fei & Perona, 2005), social networks (Airoldi et al., 2009), brain activity (Penny et al., 2011), and is often considered gold standard in many of these application domains. Learning a Bayesisan model typically involves sampling from a posterior distribution, therefore the learning process is inherently randomized. Differential privacy (DP) is a cryptography-inspired notion of privacy (Dwork, 2006; Dwork et al., 2006). It is designed to provide a very strong form of protection of individual user's private information and at the same time allow data analyses to be conducted with proper utility. Any algorithm that preserves differential privacy must be appropriately randomized too. For instance, one can differential-privately release the average salary of Californian males by adding a Laplace noise proportional to the sensitivity of this figure upon small perturbation of the data sample. In this paper, we connect the two seemingly unrelated concepts by showing that under standard assumptions, the intrinsic randomization in the Bayesian learning can be exploited to obtain a degree of differential privacy.
Early Stopping is Nonparametric Variational Inference
Maclaurin, Dougal, Duvenaud, David, Adams, Ryan P.
We show that unconverged stochastic gradient descent can be interpreted as a procedure that samples from a nonparametric variational approximate posterior distribution. This distribution is implicitly defined as the transformation of an initial distribution by a sequence of optimization updates. By tracking the change in entropy over this sequence of transformations during optimization, we form a scalable, unbiased estimate of the variational lower bound on the log marginal likelihood. We can use this bound to optimize hyperparameters instead of using cross-validation. This Bayesian interpretation of SGD suggests improved, overfitting-resistant optimization procedures, and gives a theoretical foundation for popular tricks such as early stopping and ensembling. We investigate the properties of this marginal likelihood estimator on neural network models.