Bayesian Inference
Supervised classification via minimax probabilistic transformations
Mazuelas, Santiago, Zanoni, Andrea, Perez, Aritz
One of the most common and studied problem in machine learning is classification. While conventional algorithms for supervised classification rely on the determination of a function from features to labels, we propose a different approach based on the estimation of a probabilistic transformation from features to labels. Indeed, we determine a conditional probability distribution of the labels given the features and then features are classified as labels following such distribution. In order to compute the conditional distribution, we follow a robust minimax approach, minimizing the worst-case expectation of the 0-1 loss. By doing so, we find the probabilistic transformation which achieves the minimum risk against an uncertainty set consistent with the training data. We show numerical results obtained by an implementation in python of this method and we compare its performance with state of the art techniques.
Key Terms in the Field of Artificial Intelligence
Binary Tree – a tree data structure where each node has at most two nodes (left and right nodes) and a data element. The topmost node of the tree is the root node. Bayes' Theorem – named after 18th century British mathematician Thomas Bayes, it is a formula for determining conditional probability Eigenvalue – any number such that a given matrix minus that number times the identity matrix has zero determinant. Eigenvector - a vector which when operated on by a given operator gives a scalar multiple of itself. Fourier transform – named after French mathematician Joseph Fourier, it's a method for converting a time function into one expressed in terms of frequency
Understanding MCMC Dynamics as Flows on the Wasserstein Space
Liu, Chang, Zhuo, Jingwei, Zhu, Jun
It is known that the Langevin dynamics used in MCMC is the gradient flow of the KL divergence on the Wasserstein space, which helps convergence analysis and inspires recent particle-based variational inference methods (ParVIs). But no more MCMC dynamics is understood in this way. In this work, by developing novel concepts, we propose a theoretical framework that recognizes a general MCMC dynamics as the fiber-gradient Hamiltonian flow on the Wasserstein space of a fiber-Riemannian Poisson manifold. The "conservation + convergence" structure of the flow gives a clear picture on the behavior of general MCMC dynamics. We analyse existing MCMC instances under the framework. The framework also enables ParVI simulation of MCMC dynamics, which enriches the ParVI family with more efficient dynamics, and also adapts ParVI advantages to MCMCs. We develop two ParVI methods for a particular MCMC dynamics and demonstrate the benefits in experiments.
Meta Particle Flow for Sequential Bayesian Inference
Chen, Xinshi, Dai, Hanjun, Song, Le
We present a particle flow realization of Bayes' rule, where an ODE-based neural operator is used to transport particles from a prior to its posterior after a new observation. We prove that such an ODE operator exists and its neural parameterization can be trained in a meta-learning framework, allowing this operator to reason about the effect of an individual observation on the posterior, and thus generalize across different priors, observations and to online Bayesian inference. We demonstrated the generalization ability of our particle flow Bayes operator in several canonical and high dimensional examples.
Efficient Learning of Discrete Graphical Models
Vuffray, Marc, Misra, Sidhant, Lokhov, Andrey Y.
Graphical models are useful tools for describing structured high-dimensional probability distributions. Development of efficient algorithms for learning graphical models with least amount of data remains an active research topic. Reconstruction of graphical models that describe the statistics of discrete variables is a particularly challenging problem, for which the maximum likelihood approach is intractable. In this work, we provide the first sample-efficient method based on the Interaction Screening framework that allows one to provably learn fully general discrete factor models with node-specific discrete alphabets and multi-body interactions, specified in an arbitrary basis. We identify a single condition related to model parametrization that leads to rigorous guarantees on the recovery of model structure and parameters in any error norm, and is readily verifiable for a large class of models. Importantly, our bounds make explicit distinction between parameters that are proper to the model and priors used as an input to the algorithm. Finally, we show that the Interaction Screening framework includes all models previously considered in the literature as special cases, and for which our analysis shows a systematic improvement in sample complexity.
Challenges with EM in application to weakly identifiable mixture models
Dwivedi, Raaz, Ho, Nhat, Khamaru, Koulik, Wainwright, Martin J., Jordan, Michael I., Yu, Bin
We study a class of weakly identifiable location-scale mixture models for which the maximum likelihood estimates based on $n$ i.i.d. samples are known to have lower accuracy than the classical $n^{- \frac{1}{2}}$ error. We investigate whether the Expectation-Maximization (EM) algorithm also converges slowly for these models. We first demonstrate via simulation studies a broad range of over-specified mixture models for which the EM algorithm converges very slowly, both in one and higher dimensions. We provide a complete analytical characterization of this behavior for fitting data generated from a multivariate standard normal distribution using two-component Gaussian mixture with varying location and scale parameters. Our results reveal distinct regimes in the convergence behavior of EM as a function of the dimension $d$. In the multivariate setting ($d \geq 2$), when the covariance matrix is constrained to a multiple of the identity matrix, the EM algorithm converges in order $(n/d)^{\frac{1}{2}}$ steps and returns estimates that are at a Euclidean distance of order ${(n/d)^{-\frac{1}{4}}}$ and ${ (n d)^{- \frac{1}{2}}}$ from the true location and scale parameter respectively. On the other hand, in the univariate setting ($d = 1$), the EM algorithm converges in order $n^{\frac{3}{4} }$ steps and returns estimates that are at a Euclidean distance of order ${ n^{- \frac{1}{8}}}$ and ${ n^{-\frac{1} {4}}}$ from the true location and scale parameter respectively. Establishing the slow rates in the univariate setting requires a novel localization argument with two stages, with each stage involving an epoch-based argument applied to a different surrogate EM operator at the population level. We also show multivariate ($d \geq 2$) examples, involving more general covariance matrices, that exhibit the same slow rates as the univariate case.
Maximum Likelihood Estimation and Graph Matching in Errorfully Observed Networks
Arroyo, Jesús, Sussman, Daniel L., Priebe, Carey E., Lyzinski, Vince
Given a pair of graphs with the same number of vertices, the inexact graph matching problem consists in finding a correspondence between the vertices of these graphs that minimizes the total number of induced edge disagreements. We study this problem from a statistical framework in which one of the graphs is an errorfully observed copy of the other. We introduce a corrupting channel model, and show that in this model framework, the solution to the graph matching problem is a maximum likelihood estimator. Necessary and sufficient conditions for consistency of this MLE are presented, as well as a relaxed notion of consistency in which a negligible fraction of the vertices need not be matched correctly. The results are used to study matchability in several families of random graphs, including edge independent models, random regular graphs and small-world networks. We also use these results to introduce measures of matching feasibility, and experimentally validate the results on simulated and real-world networks.
Passed & Spurious: analysing descent algorithms and local minima in spiked matrix-tensor model
Mannelli, Stefano Sarao, Krzakala, Florent, Urbani, Pierfrancesco, Zdeborová, Lenka
We study a loss function that is the negative log-likelihood of the model. We analyse the number of local minima at a fixed distance from the signal/spike with the Kac-Rice formula, and locate trivialization of the landscape at large signal-to-noise ratios. We evaluate in a closed form the performance of a gradient flow algorithm using integro-differential PDEs as developed in physics of disordered systems for the Langevin dynamics. We analyze the performance of an approximate message passing algorithm estimating the maximum likelihood configuration via its state evolution. We conclude by comparing the above results: while we observe a drastic slow down of the gradient flow dynamics even in the region where the landscape is trivial, both the analyzed algorithms are shown to perform well even in the part of the region of parameters where spurious local minima are present.
ProBO: a Framework for Using Probabilistic Programming in Bayesian Optimization
Neiswanger, Willie, Kandasamy, Kirthevasan, Poczos, Barnabas, Schneider, Jeff, Xing, Eric
Optimizing an expensive-to-query function is a common task in science and engineering, where it is beneficial to keep the number of queries to a minimum. A popular strategy is Bayesian optimization (BO), which leverages probabilistic models for this task. Most BO today uses Gaussian processes (GPs), or a few other surrogate models. However, there is a broad set of Bayesian modeling techniques that we may want to use to capture complex systems and reduce the number of queries. Probabilistic programs (PPs) are modern tools that allow for flexible model composition, incorporation of prior information, and automatic inference. In this paper, we develop ProBO, a framework for BO using only standard operations common to most PPs. This allows a user to drop in an arbitrary PP implementation and use it directly in BO. To do this, we describe black box versions of popular acquisition functions that can be used in our framework automatically, without model-specific derivation, and show how to optimize these functions. We also introduce a model, which we term the Bayesian Product of Experts, that integrates into ProBO and can be used to combine information from multiple models implemented with different PPs. We show empirical results using multiple PP implementations, and compare against standard BO methods.
Minimizing Negative Transfer of Knowledge in Multivariate Gaussian Processes: A Scalable and Regularized Approach
Kontar, Raed, Raskutti, Garvesh, Zhou, Shiyu
Recently there has been an increasing interest in the multivariate Gaussian process (MGP) which extends the Gaussian process (GP) to deal with multiple outputs. One approach to construct the MGP and account for non-trivial commonalities amongst outputs employs a convolution process (CP). The CP is based on the idea of sharing latent functions across several convolutions. Despite the elegance of the CP construction, it provides new challenges that need yet to be tackled. First, even with a moderate number of outputs, model building is extremely prohibitive due to the huge increase in computational demands and number of parameters to be estimated. Second, the negative transfer of knowledge may occur when some outputs do not share commonalities. In this paper we address these issues. We propose a regularized pairwise modeling approach for the MGP established using CP. The key feature of our approach is to distribute the estimation of the full multivariate model into a group of bivariate GPs which are individually built. Interestingly pairwise modeling turns out to possess unique characteristics, which allows us to tackle the challenge of negative transfer through penalizing the latent function that facilitates information sharing in each bivariate model. Predictions are then made through combining predictions from the bivariate models within a Bayesian framework. The proposed method has excellent scalability when the number of outputs is large and minimizes the negative transfer of knowledge between uncorrelated outputs. Statistical guarantees for the proposed method are studied and its advantageous features are demonstrated through numerical studies.