Learning Graphical Models
Accurate and Conservative Estimates of MRF Log-likelihood using Reverse Annealing
Burda, Yuri, Grosse, Roger B., Salakhutdinov, Ruslan
Markov random fields (MRFs) are difficult to evaluate as generative models because computing the test log-probabilities requires the intractable partition function. Annealed importance sampling (AIS) is widely used to estimate MRF partition functions, and often yields quite accurate results. However, AIS is prone to overestimate the log-likelihood with little indication that anything is wrong. We present the Reverse AIS Estimator (RAISE), a stochastic lower bound on the log-likelihood of an approximation to the original MRF model. RAISE requires only the same MCMC transition operators as standard AIS. Experimental results indicate that RAISE agrees closely with AIS log-probability estimates for RBMs, DBMs, and DBNs, but typically errs on the side of underestimating, rather than overestimating, the log-likelihood.
Understanding and Designing Complex Systems: Response to "A framework for optimal high-level descriptions in science and engineering---preliminary report"
Crutchfield, James P., James, Ryan G., Marzen, Sarah, Varn, Dowman P.
Building compact models of nonlinear processes goes to the heart of our understanding the complex world around us--a world replete with unanticipated, emergent patterns. Via discovery mechanisms that we do not yet understand well, we eventually do come to know many of these patterns, even if we have never seen them before. Such discoveries can be substantial. At a minimum, compact models that capture such emergent "macrostates" are essential tools in harnessing complex processes to useful ends. Most ambitiously, one would hope to automate the discovery process itself, providing an especially useful tool for the era of Big Data. One key problem in the larger endeavor of pattern discovery is dimension reduction: reduce the high-dimensional state space of a stochastic dynamical system into smaller, more manageable models that nonetheless still capture the relevant dynamics. The study of complex systems always requires this.
Tutorial on Structured Continuous-Time Markov Processes
A continuous-time Markov process (CTMP) is a collection of variables indexed by a continuous quantity, time. It obeys the Markov property that the distribution over a future variable is independent of past variables given the state at the present time. We introduce continuous-time Markov process representations and algorithms for filtering, smoothing, expected sufficient statistics calculations, and model estimation, assuming no prior knowledge of continuous-time processes but some basic knowledge of probability and statistics. We begin by describing "flat" or unstructured Markov processes and then move to structured Markov processes (those arising from state spaces consisting of assignments to variables) including Kronecker, decision-diagram, and continuous-time Bayesian network representations. We provide the first connection between decision-diagrams and continuous-time Bayesian networks.
Model Selection in High-Dimensional Misspecified Models
Basu, Pallavi, Feng, Yang, Lv, Jinchi
Model selection is indispensable to high-dimensional sparse modeling in selecting the best set of covariates among a sequence of candidate models. Most existing work assumes implicitly that the model is correctly specified or of fixed dimensions. Yet model misspecification and high dimensionality are common in real applications. In this paper, we investigate two classical Kullback-Leibler divergence and Bayesian principles of model selection in the setting of high-dimensional misspecified models. Asymptotic expansions of these principles reveal that the effect of model misspecification is crucial and should be taken into account, leading to the generalized AIC and generalized BIC in high dimensions. With a natural choice of prior probabilities, we suggest the generalized BIC with prior probability which involves a logarithmic factor of the dimensionality in penalizing model complexity. We further establish the consistency of the covariance contrast matrix estimator in a general setting. Our results and new method are supported by numerical studies.
Locally Weighted Learning for Naive Bayes Classifier
As a consequence of the strong and usually violated conditional independence assumption (CIA) of naive Bayes (NB) classifier, the performance of NB becomes less and less favorable compared to sophisticated classifiers when the sample size increases. We learn from this phenomenon that when the size of the training data is large, we should either relax the assumption or apply NB to a "reduced" data set, say for example use NB as a local model. The latter approach trades the ignored information for the robustness to the model assumption. In this paper, we consider using NB as a model for locally weighted data. A special weighting function is designed so that if CIA holds for the unweighted data, it also holds for the weighted data. The new method is intuitive and capable of handling class imbalance. It is theoretically more sound than the locally weighted learners of naive Bayes that base classification only on the $k$ nearest neighbors. Empirical study shows that the new method with appropriate choice of parameter outperforms seven existing classifiers of similar nature.
Parameter estimation in spherical symmetry groups
Chen, Yu-Hui, Wei, Dennis, Newstadt, Gregory, DeGraef, Marc, Simmons, Jeffrey, Hero, Alfred
This paper considers statistical estimation problems where the probability distribution of the observed random variable is invariant with respect to actions of a finite topological group. It is shown that any such distribution must satisfy a restricted finite mixture representation. When specialized to the case of distributions over the sphere that are invariant to the actions of a finite spherical symmetry group $\mathcal G$, a group-invariant extension of the Von Mises Fisher (VMF) distribution is obtained. The $\mathcal G$-invariant VMF is parameterized by location and scale parameters that specify the distribution's mean orientation and its concentration about the mean, respectively. Using the restricted finite mixture representation these parameters can be estimated using an Expectation Maximization (EM) maximum likelihood (ML) estimation algorithm. This is illustrated for the problem of mean crystal orientation estimation under the spherically symmetric group associated with the crystal form, e.g., cubic or octahedral or hexahedral. Simulations and experiments establish the advantages of the extended VMF EM-ML estimator for data acquired by Electron Backscatter Diffraction (EBSD) microscopy of a polycrystalline Nickel alloy sample.
Cauchy Principal Component Analysis
Principal Component Analysis (PCA) has wide applications in machine learning, text mining and computer vision. Classical PCA based on a Gaussian noise model is fragile to noise of large magnitude. Laplace noise assumption based PCA methods cannot deal with dense noise effectively. In this paper, we propose Cauchy Principal Component Analysis (Cauchy PCA), a very simple yet effective PCA method which is robust to various types of noise. We utilize Cauchy distribution to model noise and derive Cauchy PCA under the maximum likelihood estimation (MLE) framework with low rank constraint. Our method can robustly estimate the low rank matrix regardless of whether noise is large or small, dense or sparse. We analyze the robustness of Cauchy PCA from a robust statistics view and present an efficient singular value projection optimization method. Experimental results on both simulated data and real applications demonstrate the robustness of Cauchy PCA to various noise patterns.
From dependency to causality: a machine learning approach
Bontempi, Gianluca, Flauder, Maxime
The relationship between statistical dependency and causality lies at the heart of all statistical approaches to causal inference and can be summarized by two famous statements: correlation (or more generally statistical association) does not imply causation and causation induces a statistical dependency between causes and effects (or more generally descendants) ([26]). In other terms it is well known that statistical dependency is a necessary yet not sufficient condition for causality. The unidirectional link between these 1 two notions has been used by many formal approaches to causality to justify the adoption of statistical methods for detecting or inferring causal links from observational data. The most influential one is the Causal Bayesian Network approach, detailed in ([17]) which relies on notions of independence and conditional independence to detect causal patterns in the data. Well known examples of related inference algorithms are the constraint-based methods like the PC algorithms ([30]) and IC ([23]). These approaches are founded on probability theory and have been shown to be accurate in reconstructing causal patterns in many applications.
Surpassing Human-Level Face Verification Performance on LFW with GaussianFace
Face verification remains a challenging problem in very complex conditions with large variations such as pose, illumination, expression, and occlusions. This problem is exacerbated when we rely unrealistically on a single training data source, which is often insufficient to cover the intrinsically complex face variations. This paper proposes a principled multi-task learning approach based on Discriminative Gaussian Process Latent Variable Model, named GaussianFace, to enrich the diversity of training data. In comparison to existing methods, our model exploits additional data from multiple source-domains to improve the generalization performance of face verification in an unknown target-domain. Importantly, our model can adapt automatically to complex data distributions, and therefore can well capture complex face variations inherent in multiple sources. Extensive experiments demonstrate the effectiveness of the proposed model in learning from diverse data sources and generalize to unseen domain. Specifically, the accuracy of our algorithm achieves an impressive accuracy rate of 98.52% on the well-known and challenging Labeled Faces in the Wild (LFW) benchmark [23]. For the first time, the human-level performance in face verification (97.53%) [28] on LFW is surpassed.
Tag-Aware Ordinal Sparse Factor Analysis for Learning and Content Analytics
Lan, Andrew S., Studer, Christoph, Waters, Andrew E., Baraniuk, Richard G.
Machine learning offers novel ways and means to design personalized learning systems wherein each student's educational experience is customized in real time depending on their background, learning goals, and performance to date. SPARse Factor Analysis (SPARFA) is a novel framework for machine learning-based learning analytics, which estimates a learner's knowledge of the concepts underlying a domain, and content analytics, which estimates the relationships among a collection of questions and those concepts. SPARFA jointly learns the associations among the questions and the concepts, learner concept knowledge profiles, and the underlying question difficulties, solely based on the correct/incorrect graded responses of a population of learners to a collection of questions. In this paper, we extend the SPARFA framework significantly to enable: (i) the analysis of graded responses on an ordinal scale (partial credit) rather than a binary scale (correct/incorrect); (ii) the exploitation of tags/labels for questions that partially describe the question-concept associations. The resulting Ordinal SPARFA-Tag framework greatly enhances the interpretability of the estimated concepts. We demonstrate using real educational data that Ordinal SPARFA-Tag outperforms both SPARFA and existing collaborative filtering techniques in predicting missing learner responses.