Learning Graphical Models
Expectation Propagation
Raymond, Jack, Manoel, Andre, Opper, Manfred
Variational inference is a powerful concept that underlies many iterative approximation algorithms; expectation propagation, mean-field methods and belief propagations were all central themes at the school that can be perceived from this unifying framework. The lectures of Manfred Opper introduce the archetypal example of Expectation Propagation, before establishing the connection with the other approximation methods. Corrections by expansion about the expectation propagation are then explained. Finally some advanced inference topics and applications are explored in the final sections.
On the Maximum Entropy Property of the First-Order Stable Spline Kernel and its Implications
A new nonparametric approach for system identification has been recently proposed where the impulse response is seen as the realization of a zero--mean Gaussian process whose covariance, the so--called stable spline kernel, guarantees that the impulse response is almost surely stable. Maximum entropy properties of the stable spline kernel have been pointed out in the literature. In this paper we provide an independent proof that relies on the theory of matrix extension problems in the graphical model literature and leads to a closed form expression for the inverse of the first order stable spline kernel as well as to a new factorization in the form $UWU^\top$ with $U$ upper triangular and $W$ diagonal. Interestingly, all first--order stable spline kernels share the same factor $U$ and $W$ admits a closed form representation in terms of the kernel hyperparameter, making the factorization computationally inexpensive. Maximum likelihood properties of the stable spline kernel are also highlighted. These results can be applied both to improve the stability and to reduce the computational complexity associated with the computation of stable spline estimators.
Tight Error Bounds for Structured Prediction
Globerson, Amir, Roughgarden, Tim, Sontag, David, Yildirim, Cafer
Structured prediction tasks in machine learning involve the simultaneous prediction of multiple labels. This is typically done by maximizing a score function on the space of labels, which decomposes as a sum of pairwise elements, each depending on two specific labels. Intuitively, the more pairwise terms are used, the better the expected accuracy. However, there is currently no theoretical account of this intuition. This paper takes a significant step in this direction. We formulate the problem as classifying the vertices of a known graph $G=(V,E)$, where the vertices and edges of the graph are labelled and correlate semi-randomly with the ground truth. We show that the prospects for achieving low expected Hamming error depend on the structure of the graph $G$ in interesting ways. For example, if $G$ is a very poor expander, like a path, then large expected Hamming error is inevitable. Our main positive result shows that, for a wide class of graphs including 2D grid graphs common in machine vision applications, there is a polynomial-time algorithm with small and information-theoretically near-optimal expected error. Our results provide a first step toward a theoretical justification for the empirical success of the efficient approximate inference algorithms that are used for structured prediction in models where exact inference is intractable.
SAME but Different: Fast and High-Quality Gibbs Parameter Estimation
Zhao, Huasha, Jiang, Biye, Canny, John
Gibbs sampling is a workhorse for Bayesian inference but has several limitations when used for parameter estimation, and is often much slower than non-sampling inference methods. SAME (State Augmentation for Marginal Estimation) \cite{Doucet99,Doucet02} is an approach to MAP parameter estimation which gives improved parameter estimates over direct Gibbs sampling. SAME can be viewed as cooling the posterior parameter distribution and allows annealed search for the MAP parameters, often yielding very high quality (lower loss) estimates. But it does so at the expense of additional samples per iteration and generally slower performance. On the other hand, SAME dramatically increases the parallelism in the sampling schedule, and is an excellent match for modern (SIMD) hardware. In this paper we explore the application of SAME to graphical model inference on modern hardware. We show that combining SAME with factored sample representation (or approximation) gives throughput competitive with the fastest symbolic methods, but with potentially better quality. We describe experiments on Latent Dirichlet Allocation, achieving speeds similar to the fastest reported methods (online Variational Bayes) and lower cross-validated loss than other LDA implementations. The method is simple to implement and should be applicable to many other models.
Particle Metropolis-Hastings using gradient and Hessian information
Dahlin, Johan, Lindsten, Fredrik, Schön, Thomas B.
Particle Metropolis-Hastings (PMH) allows for Bayesian parameter inference in nonlinear state space models by combining Markov chain Monte Carlo (MCMC) and particle filtering. The latter is used to estimate the intractable likelihood. In its original formulation, PMH makes use of a marginal MCMC proposal for the parameters, typically a Gaussian random walk. However, this can lead to a poor exploration of the parameter space and an inefficient use of the generated particles. We propose a number of alternative versions of PMH that incorporate gradient and Hessian information about the posterior into the proposal. This information is more or less obtained as a byproduct of the likelihood estimation. Indeed, we show how to estimate the required information using a fixed-lag particle smoother, with a computational cost growing linearly in the number of particles. We conclude that the proposed methods can: (i) decrease the length of the burn-in phase, (ii) increase the mixing of the Markov chain at the stationary phase, and (iii) make the proposal distribution scale invariant which simplifies tuning.
Model-based Kernel Sum Rule
Nishiyama, Yu, Kanagawa, Motonobu, Gretton, Arthur, Fukumizu, Kenji
In this study, we enrich the framework of nonparametric kernel Bayesian inference via the flexible incorporation of certain probabilistic models, such as additive Gaussian noise models. Nonparametric inference expressed in terms of kernel means, which is called kernel Bayesian inference, has been studied using basic rules such as the kernel sum rule (KSR), kernel chain rule, kernel product rule, and kernel Bayes' rule (KBR). However, the current framework used for kernel Bayesian inference deals only with nonparametric inference and it cannot allow inference when combined with probabilistic models. In this study, we introduce a novel KSR, called model-based KSR (Mb-KSR), which exploits the knowledge obtained from some probabilistic models of conditional distributions. The incorporation of Mb-KSR into nonparametric kernel Bayesian inference facilitates more flexible kernel Bayesian inference than nonparametric inference. We focus on combinations of Mb-KSR, Non-KSR, and KBR, and we propose a filtering algorithm for state space models, which combines nonparametric learning of the observation process using kernel means and additive Gaussian noise models of the transition dynamics. The idea of the Mb-KSR for additive Gaussian noise models can be extended to more general noise model cases, including a conjugate pair with a positive-definite kernel and a probabilistic model.
Statistical inference with probabilistic graphical models
Drémeau, Angélique, Schülke, Christophe, Xu, Yingying, Shah, Devavrat
These are notes from the lecture of Devavrat Shah given at the autumn school "Statistical Physics, Optimization, Inference, and Message-Passing Algorithms", that took place in Les Houches, France from Monday September 30th, 2013, till Friday October 11th, 2013. The school was organized by Florent Krzakala from UPMC & ENS Paris, Federico Ricci-Tersenghi from La Sapienza Roma, Lenka Zdeborova from CEA Saclay & CNRS, and Riccardo Zecchina from Politecnico Torino. This lecture of Devavrat Shah (MIT) covers the basics of inference and learning. It explains how inference problems are represented within structures known as graphical models. The theoretical basis of the belief propagation algorithm is then explained and derived. This lecture sets the stage for generalizations and applications of message passing algorithms.
Hardness of parameter estimation in graphical models
Bresler, Guy, Gamarnik, David, Shah, Devavrat
We consider the problem of learning the canonical parameters specifying an undirected graphical model (Markov random field) from the mean parameters. For graphical models representing a minimal exponential family, the canonical parameters are uniquely determined by the mean parameters, so the problem is feasible in principle. The goal of this paper is to investigate the computational feasibility of this statistical task. Our main result shows that parameter estimation is in general intractable: no algorithm can learn the canonical parameters of a generic pair-wise binary graphical model from the mean parameters in time bounded by a polynomial in the number of variables (unless RP = NP). Indeed, such a result has been believed to be true (see the monograph by Wainwright and Jordan (2008)) but no proof was known. Our proof gives a polynomial time reduction from approximating the partition function of the hard-core model, known to be hard, to learning approximate parameters. Our reduction entails showing that the marginal polytope boundary has an inherent repulsive property, which validates an optimization procedure over the polytope that does not use any knowledge of its structure (as required by the ellipsoid method and others).
Collapsed Variational Bayes Inference of Infinite Relational Model
Ishiguro, Katsuhiko, Sato, Issei, Ueda, Naonori
The Infinite Relational Model (IRM) is a probabilistic model for relational data clustering that partitions objects into clusters based on observed relationships. This paper presents Averaged CVB (ACVB) solutions for IRM, convergence-guaranteed and practically useful fast Collapsed Variational Bayes (CVB) inferences. We first derive ordinary CVB and CVB0 for IRM based on the lower bound maximization. CVB solutions yield deterministic iterative procedures for inferring IRM given the truncated number of clusters. Our proposal includes CVB0 updates of hyperparameters including the concentration parameter of the Dirichlet Process, which has not been studied in the literature. To make the CVB more practically useful, we further study the CVB inference in two aspects. First, we study the convergence issues and develop a convergence-guaranteed algorithm for any CVB-based inferences called ACVB, which enables automatic convergence detection and frees non-expert practitioners from difficult and costly manual monitoring of inference processes. Second, we present a few techniques for speeding up IRM inferences. In particular, we describe the linear time inference of CVB0, allowing the IRM for larger relational data uses. The ACVB solutions of IRM showed comparable or better performance compared to existing inference methods in experiments, and provide deterministic, faster, and easier convergence detection.
Anomaly Detection Based on Aggregation of Indicators
Rabenoro, Tsirizo, Lacaille, Jérôme, Cottrell, Marie, Rossi, Fabrice
Automatic anomaly detection is a major issue in various areas. Beyond mere detection, the identification of the origin of the problem that produced the anomaly is also essential. This paper introduces a general methodology that can assist human operators who aim at classifying monitoring signals. The main idea is to leverage expert knowledge by generating a very large number of indicators. A feature selection method is used to keep only the most discriminant indicators which are used as inputs of a Naive Bayes classifier. The parameters of the classifier have been optimized indirectly by the selection process. Simulated data designed to reproduce some of the anomaly types observed in real world engines.