Bayesian Inference
Phase transitions and sample complexity in Bayes-optimal matrix factorization
Kabashima, Yoshiyuki, Krzakala, Florent, Mรฉzard, Marc, Sakata, Ayaka, Zdeborovรก, Lenka
We analyse the matrix factorization problem. Given a noisy measurement of a product of two matrices, the problem is to estimate back the original matrices. It arises in many applications such as dictionary learning, blind matrix calibration, sparse principal component analysis, blind source separation, low rank matrix completion, robust principal component analysis or factor analysis. It is also important in machine learning: unsupervised representation learning can often be studied through matrix factorization. We use the tools of statistical mechanics - the cavity and replica methods - to analyze the achievability and computational tractability of the inference problems in the setting of Bayes-optimal inference, which amounts to assuming that the two matrices have random independent elements generated from some known distribution, and this information is available to the inference algorithm. In this setting, we compute the minimal mean-squared-error achievable in principle in any computational time, and the error that can be achieved by an efficient approximate message passing algorithm. The computation is based on the asymptotic state-evolution analysis of the algorithm. The performance that our analysis predicts, both in terms of the achieved mean-squared-error, and in terms of sample complexity, is extremely promising and motivating for a further development of the algorithm.
New Optimisation Methods for Machine Learning
A thesis submitted for the degree of Doctor of Philosophy of The Australian National University. In this work we introduce several new optimisation methods for problems in machine learning. Our algorithms broadly fall into two categories: optimisation of finite sums and of graph structured objectives. The finite sum problem is simply the minimisation of objective functions that are naturally expressed as a summation over a large number of terms, where each term has a similar or identical weight. Such objectives most often appear in machine learning in the empirical risk minimisation framework in the non-online learning setting. The second category, that of graph structured objectives, consists of objectives that result from applying maximum likelihood to Markov random field models. Unlike the finite sum case, all the non-linearity is contained within a partition function term, which does not readily decompose into a summation. For the finite sum problem, we introduce the Finito and SAGA algorithms, as well as variants of each. For graph-structured problems, we take three complementary approaches. We look at learning the parameters for a fixed structure, learning the structure independently, and learning both simultaneously. Specifically, for the combined approach, we introduce a new method for encouraging graph structures with the "scale-free" property. For the structure learning problem, we establish SHORTCUT, a O(n^{2.5}) expected time approximate structure learning method for Gaussian graphical models. For problems where the structure is known but the parameters unknown, we introduce an approximate maximum likelihood learning algorithm that is capable of learning a useful subclass of Gaussian graphical models.
Learning Network of Multivariate Hawkes Processes: A Time Series Approach
Etesami, Jalal, Kiyavash, Negar, Zhang, Kun, Singhal, Kushagra
Learning the influence structure of multiple time series data is of great interest to many disciplines. This paper studies the problem of recovering the causal structure in network of multivariate linear Hawkes processes. In such processes, the occurrence of an event in one process affects the probability of occurrence of new events in some other processes. Thus, a natural notion of causality exists between such processes captured by the support of the excitation matrix. We show that the resulting causal influence network is equivalent to the Directed Information graph (DIG) of the processes, which encodes the causal factorization of the joint distribution of the processes. Furthermore, we present an algorithm for learning the support of excitation matrix (or equivalently the DIG). The performance of the algorithm is evaluated on synthesized multivariate Hawkes networks as well as a stock market and MemeTracker real-world dataset.
Top-$K$ Ranking from Pairwise Comparisons: When Spectral Ranking is Optimal
Jang, Minje, Kim, Sunghyun, Suh, Changho, Oh, Sewoong
We explore the top-$K$ rank aggregation problem. Suppose a collection of items is compared in pairs repeatedly, and we aim to recover a consistent ordering that focuses on the top-$K$ ranked items based on partially revealed preference information. We investigate the Bradley-Terry-Luce model in which one ranks items according to their perceived utilities modeled as noisy observations of their underlying true utilities. Our main contributions are two-fold. First, in a general comparison model where item pairs to compare are given a priori, we attain an upper and lower bound on the sample size for reliable recovery of the top-$K$ ranked items. Second, more importantly, extending the result to a random comparison model where item pairs to compare are chosen independently with some probability, we show that in slightly restricted regimes, the gap between the derived bounds reduces to a constant factor, hence reveals that a spectral method can achieve the minimax optimality on the (order-wise) sample size required for top-$K$ ranking. That is to say, we demonstrate a spectral method alone to be sufficient to achieve the optimality and advantageous in terms of computational complexity, as it does not require an additional stage of maximum likelihood estimation that a state-of-the-art scheme employs to achieve the optimality. We corroborate our main results by numerical experiments.
Sequential Monte Carlo Methods for System Identification
Schรถn, Thomas B., Lindsten, Fredrik, Dahlin, Johan, Wรฅgberg, Johan, Naesseth, Christian A., Svensson, Andreas, Dai, Liang
One of the key challenges in identifying nonlinear and possibly non-Gaussian state space models (SSMs) is the intractability of estimating the system state. Sequential Monte Carlo (SMC) methods, such as the particle filter (introduced more than two decades ago), provide numerical solutions to the nonlinear state estimation problems arising in SSMs. When combined with additional identification techniques, these algorithms provide solid solutions to the nonlinear system identification problem. We describe two general strategies for creating such combinations and discuss why SMC is a natural tool for implementing these strategies.
Blind Source Separation: Fundamentals and Recent Advances (A Tutorial Overview Presented at SBrT-2001)
A number of people are found in a room and involved in loud conversations in groups, just as it would happen in a cocktail party. There might also be some background noise, which could be music, car noise from outside, etc. Each person in this room is therefore forced to listen to a mixture of speech sounds coming from various directions, along with some noise. These sounds may come directly to one's ear or have first suffered a sequence of reverberations because of their reflections on the room's walls. The problem of focusing one's listening attention on a particular speaker among this cacophony of conversations and noise has been known as the cocktail party problem [6]. It consists of separating a mixture of speech signals of different characteristics with noise added to it. The signals are a-priori unknown (one listens only to a combination of them) as is also the way they have been mixed. The above scenario is a good analog for many other examples of situations that demand for a separation of mixed signals with no presupposed knowledge on the signals and the system mixing them.
Discriminative models for robust image classification
A variety of real-world tasks involve the classification of images into pre-determined categories. Designing image classification algorithms that exhibit robustness to acquisition noise and image distortions, particularly when the available training data are insufficient to learn accurate models, is a significant challenge. This dissertation explores the development of discriminative models for robust image classification that exploit underlying signal structure, via probabilistic graphical models and sparse signal representations. Probabilistic graphical models are widely used in many applications to approximate high-dimensional data in a reduced complexity set-up. Learning graphical structures to approximate probability distributions is an area of active research. Recent work has focused on learning graphs in a discriminative manner with the goal of minimizing classification error. In the first part of the dissertation, we develop a discriminative learning framework that exploits the complementary yet correlated information offered by multiple representations (or projections) of a given signal/image. Specifically, we propose a discriminative tree-based scheme for feature fusion by explicitly learning the conditional correlations among such multiple projections in an iterative manner. Experiments reveal the robustness of the resulting graphical model classifier to training insufficiency.
Small ensembles of kriging models for optimization
Mohammadi, Hossein, Riche, Rodolphe Le, Touboul, Eric
The Efficient Global Optimization (EGO) algorithm uses a conditional Gaus-sian Process (GP) to approximate an objective function known at a finite number of observation points and sequentially adds new points which maximize the Expected Improvement criterion according to the GP. The important factor that controls the efficiency of EGO is the GP covariance function (or kernel) which should be chosen according to the objective function. Traditionally, a pa-rameterized family of covariance functions is considered whose parameters are learned through statistical procedures such as maximum likelihood or cross-validation. However, it may be questioned whether statistical procedures for learning covariance functions are the most efficient for optimization as they target a global agreement between the GP and the observations which is not the ultimate goal of optimization. Furthermore, statistical learning procedures are computationally expensive. The main alternative to the statistical learning of the GP is self-adaptation, where the algorithm tunes the kernel parameters based on their contribution to objective function improvement. After questioning the possibility of self-adaptation for kriging based optimizers, this paper proposes a novel approach for tuning the length-scale of the GP in EGO: At each iteration, a small ensemble of kriging models structured by their length-scales is created. All of the models contribute to an iterate in an EGO-like fashion. Then, the set of models is densified around the model whose length-scale yielded the best iterate and further points are produced. Numerical experiments are provided which motivate the use of many length-scales. The tested implementation does not perform better than the classical EGO algorithm in a sequential context but show the potential of the approach for parallel implementations.
BIRDNEST: Bayesian Inference for Ratings-Fraud Detection
Hooi, Bryan, Shah, Neil, Beutel, Alex, Gunnemann, Stephan, Akoglu, Leman, Kumar, Mohit, Makhija, Disha, Faloutsos, Christos
Review fraud is a pervasive problem in online commerce, in which fraudulent sellers write or purchase fake reviews to manipulate perception of their products and services. Fake reviews are often detected based on several signs, including 1) they occur in short bursts of time; 2) fraudulent user accounts have skewed rating distributions. However, these may both be true in any given dataset. Hence, in this paper, we propose an approach for detecting fraudulent reviews which combines these 2 approaches in a principled manner, allowing successful detection even when one of these signs is not present. To combine these 2 approaches, we formulate our Bayesian Inference for Rating Data (BIRD) model, a flexible Bayesian model of user rating behavior. Based on our model we formulate a likelihood-based suspiciousness metric, Normalized Expected Surprise Total (NEST). We propose a linear-time algorithm for performing Bayesian inference using our model and computing the metric. Experiments on real data show that BIRDNEST successfully spots review fraud in large, real-world graphs: the 50 most suspicious users of the Flipkart platform flagged by our algorithm were investigated and all identified as fraudulent by domain experts at Flipkart.
Automatic Differentiation Variational Inference
Kucukelbir, Alp, Tran, Dustin, Ranganath, Rajesh, Gelman, Andrew, Blei, David M.
Probabilistic modeling is iterative. A scientist posits a simple model, fits it to her data, refines it according to her analysis, and repeats. However, fitting complex models to large data is a bottleneck in this process. Deriving algorithms for new models can be both mathematically and computationally challenging, which makes it difficult to efficiently cycle through the steps. To this end, we develop automatic differentiation variational inference (ADVI). Using our method, the scientist only provides a probabilistic model and a dataset, nothing else. ADVI automatically derives an efficient variational inference algorithm, freeing the scientist to refine and explore many models. ADVI supports a broad class of models-no conjugacy assumptions are required. We study ADVI across ten different models and apply it to a dataset with millions of observations. ADVI is integrated into Stan, a probabilistic programming system; it is available for immediate use.