Goto

Collaborating Authors

 Bayesian Inference


Convergence Rates of Latent Topic Models Under Relaxed Identifiability Conditions

arXiv.org Machine Learning

In this paper we study the frequentist convergence rate for the Latent Dirichlet Allocation (Blei et al., 2003) topic models. We show that the maximum likelihood estimator converges to one of the finitely many equivalent parameters in Wasserstein's distance metric at a rate of $n^{-1/4}$ without assuming separability or non-degeneracy of the underlying topics and/or the existence of more than three words per document, thus generalizing the previous works of Anandkumar et al. (2012, 2014) from an information-theoretical perspective. We also show that the $n^{-1/4}$ convergence rate is optimal in the worst case.


Generative Bridging Network in Neural Sequence Prediction

arXiv.org Machine Learning

In order to alleviate data sparsity and over-fitting problems in maximum likelihood estimation (MLE) for sequence prediction tasks, we propose the Generative Bridging Network (GBN), in which a novel bridge module is introduced to assist the training of the sequence prediction model (the generator network). Unlike MLE directly maximizing the conditional likelihood, the bridge extends the point-wise ground truth to a bridge distribution conditioned on it, and the generator is optimized to minimize their KL-divergence. Three different GBNs, namely uniform GBN, language-model GBN and coaching GBN, are proposed to penalize confidence, enhance language smoothness and relieve learning burden. Experiments conducted on two recognized sequence prediction tasks (machine translation and abstractive text summarization) show that our proposed GBNs can yield significant improvements over strong baselines. Furthermore, by analyzing samples drawn from different bridges, expected influences on the generator are verified.


Impacts of Dirty Data: and Experimental Evaluation

arXiv.org Machine Learning

Data quality issues have attracted widespread attention due to the negative impacts of dirty data on data mining and machine learning results. The relationship between data quality and the accuracy of results could be applied on the selection of the appropriate algorithm with the consideration of data quality and the determination of the data share to clean. However, rare research has focused on exploring such relationship. Motivated by this, this paper conducts an experimental comparison for the effects of missing, inconsistent and conflicting data on classification, clustering, and regression algorithms. Based on the experimental findings, we provide guidelines for algorithm selection and data cleaning.


A particle-based variational approach to Bayesian Non-negative Matrix Factorization

arXiv.org Machine Learning

Bayesian Non-negative Matrix Factorization (NMF) is a promising approach for understanding uncertainty and structure in matrix data. However, a large volume of applied work optimizes traditional non-Bayesian NMF objectives that fail to provide a principled understanding of the non-identifiability inherent in NMF-- an issue ideally addressed by a Bayesian approach. Despite their suitability, current Bayesian NMF approaches have failed to gain popularity in an applied setting; they sacrifice flexibility in modeling for tractable computation, tend to get stuck in local modes, and require many thousands of samples for meaningful uncertainty estimates. We address these issues through a particle-based variational approach to Bayesian NMF that only requires the joint likelihood to be differentiable for tractability, uses a novel initialization technique to identify multiple modes in the posterior, and allows domain experts to inspect a `small' set of factorizations that faithfully represent the posterior. We introduce and employ a class of likelihood and prior distributions for NMF that formulate a Bayesian model using popular non-Bayesian NMF objectives. On several real datasets, we obtain better particle approximations to the Bayesian NMF posterior in less time than baselines and demonstrate the significant role that multimodality plays in NMF-related tasks.


Large-Scale Model Selection with Misspecification

arXiv.org Machine Learning

Model selection is crucial to high-dimensional learning and inference for contemporary big data applications in pinpointing the best set of covariates among a sequence of candidate interpretable models. Most existing work assumes implicitly that the models are correctly specified or have fixed dimensionality. Yet both features of model misspecification and high dimensionality are prevalent in practice. In this paper, we exploit the framework of model selection principles in misspecified models originated in Lv and Liu (2014) and investigate the asymptotic expansion of Bayesian principle of model selection in the setting of high-dimensional misspecified models. With a natural choice of prior probabilities that encourages interpretability and incorporates Kullback-Leibler divergence, we suggest the high-dimensional generalized Bayesian information criterion with prior probability (HGBIC_p) for large-scale model selection with misspecification. Our new information criterion characterizes the impacts of both model misspecification and high dimensionality on model selection. We further establish the consistency of covariance contrast matrix estimation and the model selection consistency of HGBIC_p in ultra-high dimensions under some mild regularity conditions. The advantages of our new method are supported by numerical studies.


Optimal Bipartite Network Clustering

arXiv.org Machine Learning

We consider the problem of bipartite community detection in networks, or more generally the network biclustering problem. We present a fast two-stage procedure based on spectral initialization followed by the application of a pseudo-likelihood classifier twice. Under mild regularity conditions, we establish the weak consistency of the procedure (i.e., the convergence of the misclassification rate to zero) under a general bipartite stochastic block model. We show that the procedure is optimal in the sense that it achieves the optimal convergence rate that is achievable by a biclustering oracle, adaptively over the whole class, up to constants. The optimal rate we obtain sharpens some of the existing results and generalizes others to a wide regime of average degree growth. As a special case, we recover the known exact recovery threshold in the $\log n$ regime of sparsity. To obtain the general consistency result, as part of the provable version of the algorithm, we introduce a sub-block partitioning scheme that is also computationally attractive, allowing for distributed implementation of the algorithm without sacrificing optimality. The provable version of the algorithm is derived from a general blueprint for pseudo-likelihood biclustering algorithms that employ simple EM type updates. We show the effectiveness of this general class by numerical simulations.


Capturing Structure Implicitly from Time-Series having Limited Data

arXiv.org Machine Learning

Scientific fields such as insider-threat detection and highway-safety planning often lack sufficient amounts of time-series data to estimate statistical models for the purpose of scientific discovery. Moreover, the available limited data are quite noisy. This presents a major challenge when estimating time-series models that are robust to overfitting and have well-calibrated uncertainty estimates. Most of the current literature in these fields involve visualizing the time-series for noticeable structure and hard coding them into pre-specified parametric functions. This approach is associated with two limitations. First, given that such trends may not be easily noticeable in small data, it is difficult to explicitly incorporate expressive structure into the models during formulation. Second, it is difficult to know $\textit{a priori}$ the most appropriate functional form to use. To address these limitations, a nonparametric Bayesian approach was proposed to implicitly capture hidden structure from time series having limited data. The proposed model, a Gaussian process with a spectral mixture kernel, precludes the need to pre-specify a functional form and hard code trends, is robust to overfitting and has well-calibrated uncertainty estimates.


Development and analysis of a Bayesian water balance model for large lake systems

arXiv.org Machine Learning

Water balance models (WBMs) are often employed to understand regional hydrologic cycles over various time scales. Most WBMs, however, are physically-based, and few employ state-of-the-art statistical methods to reconcile independent input measurement uncertainty and bias. Further, few WBMs exist for large lakes, and most large lake WBMs perform additive accounting, with minimal consideration towards input data uncertainty. Here, we introduce a framework for improving a previously developed large lake statistical water balance model (L2SWBM). Focusing on the water balances of Lakes Superior and Michigan-Huron, we demonstrate our new analytical framework, identifying L2SWBMs from 26 alternatives that adequately close the water balance of the lakes with satisfactory computation times compared with the prototype model. We expect our new framework will be used to develop water balance models for other lakes around the world.


Minimal I-MAP MCMC for Scalable Structure Discovery in Causal DAG Models

arXiv.org Machine Learning

Learning a Bayesian network (BN) from data can be useful for decision-making or discovering causal relationships. However, traditional methods often fail in modern applications, which exhibit a larger number of observed variables than data points. The resulting uncertainty about the underlying network as well as the desire to incorporate prior information recommend a Bayesian approach to learning the BN, but the highly combinatorial structure of BNs poses a striking challenge for inference. The current state-of-the-art methods such as order MCMC are faster than previous methods but prevent the use of many natural structural priors and still have running time exponential in the maximum indegree of the true directed acyclic graph (DAG) of the BN. We here propose an alternative posterior approximation based on the observation that, if we incorporate empirical conditional independence tests, we can focus on a high-probability DAG associated with each order of the vertices. We show that our method allows the desired flexibility in prior specification, removes timing dependence on the maximum indegree and yields provably good posterior approximations; in addition, we show that it achieves superior accuracy, scalability, and sampler mixing on several datasets.


Bayesian Incremental Learning for Deep Neural Networks

arXiv.org Machine Learning

In industrial machine learning pipelines, data often arrive in parts. Particularly in the case of deep neural networks, it may be too expensive to train the model from scratch each time, so one would rather use a previously learned model and the new data to improve performance. However, deep neural networks are prone to getting stuck in a suboptimal solution when trained on only new data as compared to the full dataset. Our work focuses on a continuous learning setup where the task is always the same and new parts of data arrive sequentially. We apply a Bayesian approach to update the posterior approximation with each new piece of data and find this method to outperform the traditional approach in our experiments.