Bayesian Inference
Sinkhorn EM: An Expectation-Maximization algorithm based on entropic optimal transport
Mena, Gonzalo, Nejatbakhsh, Amin, Varol, Erdem, Niles-Weed, Jonathan
We study Sinkhorn EM (sEM), a variant of the expectation maximization (EM) algorithm for mixtures based on entropic optimal transport. sEM differs from the classic EM algorithm in the way responsibilities are computed during the expectation step: rather than assign data points to clusters independently, sEM uses optimal transport to compute responsibilities by incorporating prior information about mixing weights. Like EM, sEM has a natural interpretation as a coordinate ascent procedure, which iteratively constructs and optimizes a lower bound on the log-likelihood. However, we show theoretically and empirically that sEM has better behavior than EM: it possesses better global convergence guarantees and is less prone to getting stuck in bad local optima. We complement these findings with experiments on simulated data as well as in an inference task involving C. elegans neurons and show that sEM learns cell labels significantly better than other approaches.
Recovering Joint Probability of Discrete Random Variables from Pairwise Marginals
Learning the joint probability of random variables (RVs) lies at the heart of statistical signal processing and machine learning. However, direct nonparametric estimation for high-dimensional joint probability is in general impossible, due to the curse of dimensionality. Recent work has proposed to recover the joint probability mass function (PMF) of an arbitrary number of RVs from three-dimensional marginals, leveraging the algebraic properties of low-rank tensor decomposition and the (unknown) dependence among the RVs. Nonetheless, accurately estimating three-dimensional marginals can still be costly in terms of sample complexity, affecting the performance of this line of work in practice in the sample-starved regime. Using three-dimensional marginals also involves challenging tensor decomposition problems whose tractability is unclear. This work puts forth a new framework for learning the joint PMF using only pairwise marginals, which naturally enjoys a lower sample complexity relative to the third-order ones. A coupled nonnegative matrix factorization (CNMF) framework is developed, and its joint PMF recovery guarantees under various conditions are analyzed. Our method also features a Gram-Schmidt (GS)-like algorithm that exhibits competitive runtime performance. The algorithm is shown to provably recover the joint PMF up to bounded error in finite iterations, under reasonable conditions. It is also shown that a recently proposed economical expectation maximization (EM) algorithm guarantees to improve upon the GS-like algorithm's output, thereby further lifting up the accuracy and efficiency. Real-data experiments are employed to showcase the effectiveness.
Mixed Logit Models and Network Formation
Gupta, Harsh, Porter, Mason A.
The study of network formation is pervasive in economics, sociology, and many other fields. In this paper, we model network formation as a ``choice'' that is made by nodes in a network to connect to other nodes. We study these ``choices'' using discrete-choice models, in which an agent chooses between two or more discrete alternatives. One framework for studying network formation is the multinomial logit (MNL) model. We highlight limitations of the MNL model on networks that are constructed from empirical data. We employ the ``repeated choice'' (RC) model to study network formation \cite{TrainRevelt97mixedlogit}. We argue that the RC model overcomes important limitations of the MNL model and is well-suited to study network formation. We also illustrate how to use the RC model to accurately study network formation using both synthetic and real-world networks. Using synthetic networks, we also compare the performance of the MNL model and the RC model; we find that the RC model estimates the data-generation process of our synthetic networks more accurately than the MNL model. We provide examples of qualitatively interesting questions -- the presence of homophily in a teen friendship network and the fact that new patents are more likely to cite older, more cited, and similar patents -- for which the RC model allows us to achieve insights.
A Tutorial on VAEs: From Bayes' Rule to Lossless Compression
The Variational Auto-Encoder (VAE) belongs to a class of models, which we will refer to as deep maximum likelihood models, that uses a deep neural network to learn a maximum likelihood model for some input data. They are perhaps the most simple and efficient deep maximum likelihood model available, and have thus gained popularity in representation learning and generative image modeling. Unfortunately, in my opinion, in some circles the term "VAE" has become somewhat synonymous with "an auto-encoder with stochastic regularization that generates useful or beautiful samples", which has led to various misconceptions about VAEs. In this tutorial, we will return to the probabilistic and information theoretic roots of VAEs, clarify common misconceptions about VAEs, and look at a toy example on 2D data that will illustrate the capabilities and limitations of VAEs. In Section 2, we will give an overview of what is a maximum likelihood model and what a VAE looks like.
Bayesian Graph Neural Networks with Adaptive Connection Sampling
Hasanzadeh, Arman, Hajiramezanali, Ehsan, Boluki, Shahin, Zhou, Mingyuan, Duffield, Nick, Narayanan, Krishna, Qian, Xiaoning
We propose a unified framework for adaptive connection sampling in graph neural networks (GNNs) that generalizes existing stochastic regularization methods for training GNNs. The proposed framework not only alleviates over-smoothing and over-fitting tendencies of deep GNNs, but also enables learning with uncertainty in graph analytic tasks with GNNs. Instead of using fixed sampling rates or hand-tuning them as model hyperparameters in existing stochastic regularization methods, our adaptive connection sampling can be trained jointly with GNN model parameters in both global and local fashions. GNN training with adaptive connection sampling is shown to be mathematically equivalent to an efficient approximation of training Bayesian GNNs. Experimental results with ablation studies on benchmark datasets validate that adaptively learning the sampling rate given graph training data is the key to boost the performance of GNNs in semi-supervised node classification, less prone to over-smoothing and over-fitting with more robust prediction.
Inference in Bayesian Additive Vector Autoregressive Tree Models
Vector autoregressive (VAR) models assume linearity between the endogenous variables and their lags. This linearity assumption might be overly restrictive and could have a deleterious impact on forecasting accuracy. As a solution, we propose combining VAR with Bayesian additive regression tree (BART) models. The resulting Bayesian additive vector autoregressive tree (BAVART) model is capable of capturing arbitrary non-linear relations between the endogenous variables and the covariates without much input from the researcher. Since controlling for heteroscedasticity is key for producing precise density forecasts, our model allows for stochastic volatility in the errors. Using synthetic and real data, we demonstrate the advantages of our methods. For Eurozone data, we show that our nonparametric approach improves upon commonly used forecasting models and that it produces impulse responses to an uncertainty shock that are consistent with established findings in the literature.
Policy Gradient Optimization of Thompson Sampling Policies
Min, Seungki, Moallemi, Ciamac C., Russo, Daniel J.
We study the use of policy gradient algorithms to optimize over a class of generalized Thompson sampling policies. Our central insight is to view the posterior parameter sampled by Thompson sampling as a kind of pseudo-action. Policy gradient methods can then be tractably applied to search over a class of sampling policies, which determine a probability distribution over pseudo-actions (i.e., sampled parameters) as a function of observed data. We also propose and compare policy gradient estimators that are specialized to Bayesian bandit problems. Numerical experiments demonstrate that direct policy search on top of Thompson sampling automatically corrects for some of the algorithm's known shortcomings and offers meaningful improvements even in long horizon problems where standard Thompson sampling is extremely effective.
Task-Agnostic Online Reinforcement Learning with an Infinite Mixture of Gaussian Processes
Xu, Mengdi, Ding, Wenhao, Zhu, Jiacheng, Liu, Zuxin, Chen, Baiming, Zhao, Ding
Continuously learning to solve unseen tasks with limited experience has been extensively pursued in meta-learning and continual learning, but with restricted assumptions such as accessible task distributions, independently and identically distributed tasks, and clear task delineations. However, real-world physical tasks frequently violate these assumptions, resulting in performance degradation. This paper proposes a continual online model-based reinforcement learning approach that does not require pre-training to solve task-agnostic problems with unknown task boundaries. We maintain a mixture of experts to handle nonstationarity, and represent each different type of dynamics with a Gaussian Process to efficiently leverage collected data and expressively model uncertainty. We propose a transition prior to account for the temporal dependencies in streaming data and update the mixture online via sequential variational inference. Our approach reliably handles the task distribution shift by generating new models for never-before-seen dynamics and reusing old models for previously seen dynamics. In experiments, our approach outperforms alternative methods in non-stationary tasks, including classic control with changing dynamics and decision making in different driving scenarios.
Partial Recovery for Top-$k$ Ranking: Optimality of MLE and Sub-Optimality of Spectral Method
Chen, Pinhan, Gao, Chao, Zhang, Anderson Y.
Given partially observed pairwise comparison data generated by the Bradley-Terry-Luce (BTL) model, we study the problem of top-$k$ ranking. That is, to optimally identify the set of top-$k$ players. We derive the minimax rate with respect to a normalized Hamming loss. This provides the first result in the literature that characterizes the partial recovery error in terms of the proportion of mistakes for top-$k$ ranking. We also derive the optimal signal to noise ratio condition for the exact recovery of the top-$k$ set. The maximum likelihood estimator (MLE) is shown to achieve both optimal partial recovery and optimal exact recovery. On the other hand, we show another popular algorithm, the spectral method, is in general sub-optimal. Our results complement the recent work by Chen et al. (2019) that shows both the MLE and the spectral method achieve the optimal sample complexity for exact recovery. It turns out the leading constants of the sample complexity are different for the two algorithms. Another contribution that may be of independent interest is the analysis of the MLE without any penalty or regularization for the BTL model. This closes an important gap between theory and practice in the literature of ranking.
Multi-fidelity modeling with different input domain definitions using Deep Gaussian Processes
Hebbal, Ali, Brevault, Loic, Balesdent, Mathieu, Talbi, El-Ghazali, Melab, Nouredine
Multi-fidelity approaches combine different models built on a scarce but accurate data-set (high-fidelity data-set), and a large but approximate one (low-fidelity data-set) in order to improve the prediction accuracy. Gaussian Processes (GPs) are one of the popular approaches to exhibit the correlations between these different fidelity levels. Deep Gaussian Processes (DGPs) that are functional compositions of GPs have also been adapted to multi-fidelity using the Multi-Fidelity Deep Gaussian process model (MF-DGP). This model increases the expressive power compared to GPs by considering nonlinear correlations between fidelities within a Bayesian framework. However, these multi-fidelity methods consider only the case where the inputs of the different fidelity models are defined over the same domain of definition (e.g., same variables, same dimensions). However, due to simplification in the modeling of the low-fidelity, some variables may be omitted or a different parametrization may be used compared to the high-fidelity model. In this paper, Deep Gaussian Processes for multi-fidelity (MF-DGP) are extended to the case where a different parametrization is used for each fidelity. The performance of the proposed multi-fidelity modeling technique is assessed on analytical test cases and on structural and aerodynamic real physical problems.