Bayesian Inference
Structure Learning in Graphical Modeling
Drton, Mathias, Maathuis, Marloes H.
A graphical model is a statistical model that is associated to a graph whose nodes correspond to variables of interest. The edges of the graph reflect allowed conditional dependencies among the variables. Graphical models admit computationally convenient factorization properties and have long been a valuable tool for tractable modeling of multivariate distributions. More recently, applications such as reconstructing gene regulatory networks from gene expression data have driven major advances in structure learning, that is, estimating the graph underlying a model. We review some of these advances and discuss methods such as the graphical lasso and neighborhood selection for undirected graphical models (or Markov random fields), and the PC algorithm and score-based search methods for directed graphical models (or Bayesian networks).
Classification of weak multi-view signals by sharing factors in a mixture of Bayesian group factor analyzers
Remes, Sami, Mononen, Tommi, Kaski, Samuel
We propose a novel classification model for weak signal data, building upon a recent model for Bayesian multi-view learning, Group Factor Analysis (GFA). Instead of assuming all data to come from a single GFA model, we allow latent clusters, each having a different GFA model and producing a different class distribution. We show that sharing information across the clusters, by sharing factors, increases the classification accuracy considerably; the shared factors essentially form a flexible noise model that explains away the part of data not related to classification. Motivation for the setting comes from single-trial functional brain imaging data, having a very low signal-to-noise ratio and a natural multi-view setting, with the different sensors, measurement modalities (EEG, MEG, fMRI) and possible auxiliary information as views. We demonstrate our model on a MEG dataset.
Revealed Preference at Scale: Learning Personalized Preferences from Assortment Choices
Kallus, Nathan, Udell, Madeleine
We consider the problem of learning the preferences of a heterogeneous population by observing choices from an assortment of products, ads, or other offerings. Our observation model takes a form common in assortment planning applications: each arriving customer is offered an assortment consisting of a subset of all possible offerings; we observe only the assortment and the customer's single choice. In this paper we propose a mixture choice model with a natural underlying low-dimensional structure, and show how to estimate its parameters. In our model, the preferences of each customer or segment follow a separate parametric choice model, but the underlying structure of these parameters over all the models has low dimension. We show that a nuclear-norm regularized maximum likelihood estimator can learn the preferences of all customers using a number of observations much smaller than the number of item-customer combinations. This result shows the potential for structural assumptions to speed up learning and improve revenues in assortment planning and customization. We provide a specialized factored gradient descent algorithm and study the success of the approach empirically.
Bayesian Poisson Tucker Decomposition for Learning the Structure of International Relations
Schein, Aaron, Zhou, Mingyuan, Blei, David M., Wallach, Hanna
We introduce Bayesian Poisson Tucker decomposition (BPTD) for modeling country--country interaction event data. These data consist of interaction events of the form "country $i$ took action $a$ toward country $j$ at time $t$." BPTD discovers overlapping country--community memberships, including the number of latent communities. In addition, it discovers directed community--community interaction networks that are specific to "topics" of action types and temporal "regimes." We show that BPTD yields an efficient MCMC inference algorithm and achieves better predictive performance than related models. We also demonstrate that it discovers interpretable latent structure that agrees with our knowledge of international relations.
Relaxation of the EM Algorithm via Quantum Annealing
Miyahara, Hideyuki, Tsumura, Koji
The EM algorithm is a novel numerical method to obtain maximum likelihood estimates and is often used for practical calculations. However, many of maximum likelihood estimation problems are nonconvex, and it is known that the EM algorithm fails to give the optimal estimate by being trapped by local optima. In order to deal with this difficulty, we propose a deterministic quantum annealing EM algorithm by introducing the mathematical mechanism of quantum fluctuations into the conventional EM algorithm because quantum fluctuations induce the tunnel effect and are expected to relax the difficulty of nonconvex optimization problems in the maximum likelihood estimation problems. We show a theorem that guarantees its convergence and give numerical experiments to verify its efficiency.
Understanding beta binomial regression (using baseball statistics)
In this series we've been using the empirical Bayes method to estimate batting averages of baseball players. Empirical Bayes is useful here because when we don't have a lot of information about a batter, they're "shrunken" towards the average across all players, as a natural consequence of the beta prior. When players are better, they are given more chances to bat! (Hat tip to Hadley Wickham to pointing this complication out to me). That means there's a relationship between the number of at-bats (AB) and the true batting average. For reasons I explain below, this makes our estimates systematically inaccurate.
Semidefinite Programs for Exact Recovery of a Hidden Community
Hajek, Bruce, Wu, Yihong, Xu, Jiaming
We study a semidefinite programming (SDP) relaxation of the maximum likelihood estimation for exactly recovering a hidden community of cardinality $K$ from an $n \times n$ symmetric data matrix $A$, where for distinct indices $i,j$, $A_{ij} \sim P$ if $i, j$ are both in the community and $A_{ij} \sim Q$ otherwise, for two known probability distributions $P$ and $Q$. We identify a sufficient condition and a necessary condition for the success of SDP for the general model. For both the Bernoulli case ($P={{\rm Bern}}(p)$ and $Q={{\rm Bern}}(q)$ with $p>q$) and the Gaussian case ($P=\mathcal{N}(\mu,1)$ and $Q=\mathcal{N}(0,1)$ with $\mu>0$), which correspond to the problem of planted dense subgraph recovery and submatrix localization respectively, the general results lead to the following findings: (1) If $K=\omega( n /\log n)$, SDP attains the information-theoretic recovery limits with sharp constants; (2) If $K=\Theta(n/\log n)$, SDP is order-wise optimal, but strictly suboptimal by a constant factor; (3) If $K=o(n/\log n)$ and $K \to \infty$, SDP is order-wise suboptimal. The same critical scaling for $K$ is found to hold, up to constant factors, for the performance of SDP on the stochastic block model of $n$ vertices partitioned into multiple communities of equal size $K$. A key ingredient in the proof of the necessary condition is a construction of a primal feasible solution based on random perturbation of the true cluster matrix.
When we say PhD in NLP or PhD in bayesian networks or PhD in boosting, how all the topics listed below are related? • /r/MachineLearning
There are three different types of topics in machine learning, the first ones are like NLP, Computer vision, Robotics etc. and other ones are algorithms in machine learning like genetic algorithms, neural networks, bayesian networks etc and thirdly there are concepts like decision trees, random forest, PCA etc. So, how are all these topics related when I say PhD in Bayesian Networks or PhD in NLP or PhD in boosting etc?
Bayesian Learning of Kernel Embeddings
Flaxman, Seth, Sejdinovic, Dino, Cunningham, John P., Filippi, Sarah
Kernel methods are one of the mainstays of machine learning, but the problem of kernel learning remains challenging, with only a few heuristics and very little theory. This is of particular importance in methods based on estimation of kernel mean embeddings of probability measures. For characteristic kernels, which include most commonly used ones, the kernel mean embedding uniquely determines its probability measure, so it can be used to design a powerful statistical testing framework, which includes nonparametric two-sample and independence tests. In practice, however, the performance of these tests can be very sensitive to the choice of kernel and its lengthscale parameters. To address this central issue, we propose a new probabilistic model for kernel mean embeddings, the Bayesian Kernel Embedding model, combining a Gaussian process prior over the Reproducing Kernel Hilbert Space containing the mean embedding with a conjugate likelihood function, thus yielding a closed form posterior over the mean embedding. The posterior mean of our model is closely related to recently proposed shrinkage estimators for kernel mean embeddings, while the posterior uncertainty is a new, interesting feature with various possible applications. Critically for the purposes of kernel learning, our model gives a simple, closed form marginal pseudolikelihood of the observed data given the kernel hyperparameters. This marginal pseudolikelihood can either be optimized to inform the hyperparameter choice or fully Bayesian inference can be used.
Black-box $\alpha$-divergence Minimization
Hernández-Lobato, José Miguel, Li, Yingzhen, Rowland, Mark, Hernández-Lobato, Daniel, Bui, Thang, Turner, Richard E.
Black-box alpha (BB-$\alpha$) is a new approximate inference method based on the minimization of $\alpha$-divergences. BB-$\alpha$ scales to large datasets because it can be implemented using stochastic gradient descent. BB-$\alpha$ can be applied to complex probabilistic models with little effort since it only requires as input the likelihood function and its gradients. These gradients can be easily obtained using automatic differentiation. By changing the divergence parameter $\alpha$, the method is able to interpolate between variational Bayes (VB) ($\alpha \rightarrow 0$) and an algorithm similar to expectation propagation (EP) ($\alpha = 1$). Experiments on probit regression and neural network regression and classification problems show that BB-$\alpha$ with non-standard settings of $\alpha$, such as $\alpha = 0.5$, usually produces better predictions than with $\alpha \rightarrow 0$ (VB) or $\alpha = 1$ (EP).