Bayesian Inference
Bayesian multi-tensor factorization
Khan, Suleiman A., Leppรคaho, Eemeli, Kaski, Samuel
We introduce Bayesian multi-tensor factorization, a model that is the first Bayesian formulation for joint factorization of multiple matrices and tensors. The research problem generalizes the joint matrix-tensor factorization problem to arbitrary sets of tensors of any depth, including matrices, can be interpreted as unsupervised multi-view learning from multiple data tensors, and can be generalized to relax the usual trilinear tensor factorization assumptions. The result is a factorization of the set of tensors into factors shared by any subsets of the tensors, and factors private to individual tensors. We demonstrate the performance against existing baselines in multiple tensor factorization tasks in structural toxicogenomics and functional neuroimaging.
Columbia University Free Online Course on Machine Learning
Columbia University is offering free online course on Machine Learning. It is a subfield of computer science that evolved from the study of pattern recognition and computational learning theory in artificial intelligence. In this course applicants will master the essentials of machine learning and algorithms to help improve learning from data without human intervention. The course will start on January 16, 2017. Columbia University is one of the world's most important centers of research and at the same time a distinctive and distinguished learning environment for undergraduates and graduate students in many scholarly and professional fields.
Phase transitions and optimal algorithms in high-dimensional Gaussian mixture clustering
Lesieur, Thibault, De Bacco, Caterina, Banks, Jess, Krzakala, Florent, Moore, Cris, Zdeborovรก, Lenka
Abstract-- We consider the problem of Gaussian mixture clustering in the high-dimensional limit where the data consists of m points in n dimensions, n, m and ฮฑ m/n stays finite. Using exact but non-rigorous methods from statistical physics, we determine the critical value of ฮฑ and the distance between the clusters at which it becomes information-theoretically possible to reconstruct the membership into clusters better than chance. We also determine the accuracy achievable by the Bayes-optimal estimation algorithm. In particular, we find that when the number of clusters is sufficiently large, r 4 2 ฮฑ, there is a gap between the threshold for informationtheoretically optimal performance and the threshold at which known algorithms succeed. Clustering m points in n-dimensional space is a ubiquitous problem in statistical inference and data science.
A Bayesian Information Criterion for Singular Models
On Wednesday, Mathias Drton and I will be presenting a read paper on Bayesian model choice for singular models at the Royal Statistical Society in London. You can read more about it on the RSS web site, where you can also download a preprint. The paper is scheduled to appear, with the discussion, in Series B of the Journal of the Royal Statistical Society next year. The CRAN package sBIC by Luca Weihs implements the ideas in the paper and includes a series of vignettes that allow you to step through some of the examples in the paper.
Pseudo-Bayesian Robust PCA: Algorithms and Analyses
Oh, Tae-Hyun, Matsushita, Yasuyuki, Kweon, In So, Wipf, David
Commonly used in computer vision and other applications, robust PCA represents an algorithmic attempt to reduce the sensitivity of classical PCA to outliers. The basic idea is to learn a decomposition of some data matrix of interest into low rank and sparse components, the latter representing unwanted outliers. Although the resulting optimization problem is typically NP-hard, convex relaxations provide a computationally-expedient alternative with theoretical support. However, in practical regimes performance guarantees break down and a variety of non-convex alternatives, including Bayesian-inspired models, have been proposed to boost estimation quality. Unfortunately though, without additional a priori knowledge none of these methods can significantly expand the critical operational range such that exact principal subspace recovery is possible. Into this mix we propose a novel pseudo-Bayesian algorithm that explicitly compensates for design weaknesses in many existing non-convex approaches leading to state-of-the-art performance with a sound analytical foundation. Surprisingly, our algorithm can even outperform convex matrix completion despite the fact that the latter is provided with perfect knowledge of which entries are not corrupted.
Gamma Belief Networks
Zhou, Mingyuan, Cong, Yulai, Chen, Bo
To infer multilayer deep representations of high-dimensional discrete and nonnegative real vectors, we propose an augmentable gamma belief network (GBN) that factorizes each of its hidden layers into the product of a sparse connection weight matrix and the nonnegative real hidden units of the next layer. The GBN's hidden layers are jointly trained with an upward-downward Gibbs sampler that solves each layer with the same subroutine. The gamma-negative binomial process combined with a layer-wise training strategy allows inferring the width of each layer given a fixed budget on the width of the first layer. Example results illustrate interesting relationships between the width of the first layer and the inferred network structure, and demonstrate that the GBN can add more layers to improve its performance in both unsupervisedly extracting features and predicting heldout data. For exploratory data analysis, we extract trees and subnetworks from the learned deep network to visualize how the very specific factors discovered at the first hidden layer and the increasingly more general factors discovered at deeper hidden layers are related to each other, and we generate synthetic data by propagating random variables through the deep network from the top hidden layer back to the bottom data layer.
Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning
Gal, Yarin, Ghahramani, Zoubin
Deep learning tools have gained tremendous attention in applied machine learning. However such tools for regression and classification do not capture model uncertainty. In comparison, Bayesian models offer a mathematically grounded framework to reason about model uncertainty, but usually come with a prohibitive computational cost. In this paper we develop a new theoretical framework casting dropout training in deep neural networks (NNs) as approximate Bayesian inference in deep Gaussian processes. A direct result of this theory gives us tools to model uncertainty with dropout NNs -- extracting information from existing models that has been thrown away so far. This mitigates the problem of representing uncertainty in deep learning without sacrificing either computational complexity or test accuracy. We perform an extensive study of the properties of dropout's uncertainty. Various network architectures and non-linearities are assessed on tasks of regression and classification, using MNIST as an example. We show a considerable improvement in predictive log-likelihood and RMSE compared to existing state-of-the-art methods, and finish by using dropout's uncertainty in deep reinforcement learning.
Data Integration with High Dimensionality
We consider a problem of data integration. Consider determining which genes affect a disease. The genes, which we call predictor objects, can be measured in different experiments on the same individual. We address the question of finding which genes are predictors of disease by any of the experiments. Our formulation is more general. In a given data set, there are a fixed number of responses for each individual, which may include a mix of discrete, binary and continuous variables. There is also a class of predictor objects, which may differ within a subject depending on how the predictor object is measured, i.e., depend on the experiment. The goal is to select which predictor objects affect any of the responses, where the number of such informative predictor objects or features tends to infinity as sample size increases. There are marginal likelihoods for each way the predictor object is measured, i.e., for each experiment. We specify a pseudolikelihood combining the marginal likelihoods, and propose a pseudolikelihood information criterion. Under regularity conditions, we establish selection consistency for the pseudolikelihood information criterion with unbounded true model size, which includes a Bayesian information criterion with appropriate penalty term as a special case. Simulations indicate that data integration improves upon, sometimes dramatically, using only one of the data sources.
Funneled Bayesian Optimization for Design, Tuning and Control of Autonomous Systems
Bayesian optimization has become a fundamental global optimization algorithm in many problems where sample efficiency is of paramount importance. Recently, there has been proposed a large number of new applications in fields such as robotics, machine learning, experimental design, simulation, etc. In this paper, we focus on several problems that appear in robotics and autonomous systems: algorithm tuning, automatic control and intelligent design. All those problems can be mapped to global optimization problems. However, they become hard optimization problems. Bayesian optimization internally uses a probabilistic surrogate model (e.g.: Gaussian process) to learn from the process and reduce the number of samples required. In order to generalize to unknown functions in a black-box fashion, the common assumption is that the underlying function can be modeled with a stationary process. Nonstationary Gaussian process regression cannot generalize easily and it typically requires prior knowledge of the function. Some works have designed techniques to generalize Bayesian optimization to nonstationary functions in an indirect way, but using techniques originally designed for regression, where the objective is to improve the quality of the surrogate model everywhere. Instead optimization should focus on improving the surrogate model near the optimum. In this paper, we present a novel kernel function specially designed for Bayesian optimization, that allows nonstationary behavior of the surrogate model in an adaptive local region. In our experiments, we found that this new kernel results in an improved local search (exploitation), without penalizing the global search (exploration). We provide results in well-known benchmarks and real applications. The new method outperforms the state of the art in Bayesian optimization both in stationary and nonstationary problems.
On Identification of Sparse Multivariable ARX Model: A Sparse Bayesian Learning Approach
Jin, J., Yuan, Y., Pan, W., Pham, D. L. T., Tomlin, C. J., Webb, A., Goncalves, J.
This paper begins with considering the identification of sparse linear time-invariant networks described by multivariable ARX models. Such models possess relatively simple structure thus used as a benchmark to promote further research. With identifiability of the network guaranteed, this paper presents an identification method that infers both the Boolean structure of the network and the internal dynamics between nodes. Identification is performed directly from data without any prior knowledge of the system, including its order. The proposed method solves the identification problem using Maximum a posteriori estimation (MAP) but with inseparable penalties for complexity, both in terms of element (order of nonzero connections) and group sparsity (network topology). Such an approach is widely applied in Compressive Sensing (CS) and known as Sparse Bayesian Learning (SBL). We then propose a novel scheme that combines sparse Bayesian and group sparse Bayesian to efficiently solve the problem. The resulted algorithm has a similar form of the standard Sparse Group Lasso (SGL) while with known noise variance, it simplifies to exact re-weighted SGL. The method and the developed toolbox can be applied to infer networks from a wide range of fields, including systems biology applications such as signaling and genetic regulatory networks.