Bayesian Inference
Max-Margin Nonparametric Latent Feature Models for Link Prediction
Zhu, Jun, Song, Jiaming, Chen, Bei
Link prediction is a fundamental task in statistical network analysis. Recent advances have been made on learning flexible nonparametric Bayesian latent feature models for link prediction. In this paper, we present a max-margin learning method for such nonparametric latent feature relational models. Our approach attempts to unite the ideas of max-margin learning and Bayesian nonparametrics to discover discriminative latent features for link prediction. It inherits the advances of nonparametric Bayesian methods to infer the unknown latent social dimension, while for discriminative link prediction, it adopts the max-margin learning principle by minimizing a hinge-loss using the linear expectation operator, without dealing with a highly nonlinear link likelihood function. For posterior inference, we develop an efficient stochastic variational inference algorithm under a truncated mean-field assumption. Our methods can scale up to large-scale real networks with millions of entities and tens of millions of positive links. We also provide a full Bayesian formulation, which can avoid tuning regularization hyper-parameters. Experimental results on a diverse range of real datasets demonstrate the benefits inherited from max-margin learning and Bayesian nonparametric inference.
Dynamic Filtering of Time-Varying Sparse Signals via l1 Minimization
Charles, Adam, Balavoine, Aurele, Rozell, Christopher
Despite the importance of sparsity signal models and the increasing prevalence of high-dimensional streaming data, there are relatively few algorithms for dynamic filtering of time-varying sparse signals. Of the existing algorithms, fewer still provide strong performance guarantees. This paper examines two algorithms for dynamic filtering of sparse signals that are based on efficient l1 optimization methods. We first present an analysis for one simple algorithm (BPDN-DF) that works well when the system dynamics are known exactly. We then introduce a novel second algorithm (RWL1-DF) that is more computationally complex than BPDN-DF but performs better in practice, especially in the case where the system dynamics model is inaccurate. Robustness to model inaccuracy is achieved by using a hierarchical probabilistic data model and propagating higher-order statistics from the previous estimate (akin to Kalman filtering) in the sparse inference process. We demonstrate the properties of these algorithms on both simulated data as well as natural video sequences. Taken together, the algorithms presented in this paper represent the first strong performance analysis of dynamic filtering algorithms for time-varying sparse signals as well as state-of-the-art performance in this emerging application.
Learning to classify with possible sensor failures
Xie, Tianpei, Nasrabadi, Nasser M., Hero, Alfred O.
Large margin classifiers, such as the support vector machine (SVM) [1] and the maximum entropy discrimination (MED) classifier [2], have enjoyed great popularity in the signal processing and machine learning communities due to their broad applicability, robust performance, and the availability of fast software implementations. When the training data is representative of the test data, the performance of MED/SVM has theoretical guarantees that have been validated in practice [1], [3], [4]. Moreover, since the decision boundary of the MED/SVM is solely defined by a few support vectors, the algorithm can tolerate random feature distortions and perturbations. However, in many real applications, anomalous measurements are inherent to the data set due to strong environmental noise or possible sensor failures. Such anomalies arise in industrial process monitoring, video surveillance, tactical multi-modal sensing, and, more generally, any application that involves unattended sensors in difficult environments (Figure 1).
Predictive Entropy Search for Multi-objective Bayesian Optimization
Hernรกndez-Lobato, Daniel, Hernรกndez-Lobato, Josรฉ Miguel, Shah, Amar, Adams, Ryan P.
We present PESMO, a Bayesian method for identifying the Pareto set of multi-objective optimization problems, when the functions are expensive to evaluate. The central idea of PESMO is to choose evaluation points so as to maximally reduce the entropy of the posterior distribution over the Pareto set. Critically, the PESMO multi-objective acquisition function can be decomposed as a sum of objective-specific acquisition functions, which enables the algorithm to be used in \emph{decoupled} scenarios in which the objectives can be evaluated separately and perhaps with different costs. This decoupling capability also makes it possible to identify difficult objectives that require more evaluations. PESMO also offers gains in efficiency, as its cost scales linearly with the number of objectives, in comparison to the exponential cost of other methods. We compare PESMO with other related methods for multi-objective Bayesian optimization on synthetic and real-world problems. The results show that PESMO produces better recommendations with a smaller number of evaluations of the objectives, and that a decoupled evaluation can lead to improvements in performance, particularly when the number of objectives is large.
Statistical Mechanics of High-Dimensional Inference
To model modern large-scale datasets, we need efficient algorithms to infer a set of $P$ unknown model parameters from $N$ noisy measurements. What are fundamental limits on the accuracy of parameter inference, given finite signal-to-noise ratios, limited measurements, prior information, and computational tractability requirements? How can we combine prior information with measurements to achieve these limits? Classical statistics gives incisive answers to these questions as the measurement density $\alpha = \frac{N}{P}\rightarrow \infty$. However, these classical results are not relevant to modern high-dimensional inference problems, which instead occur at finite $\alpha$. We formulate and analyze high-dimensional inference as a problem in the statistical physics of quenched disorder. Our analysis uncovers fundamental limits on the accuracy of inference in high dimensions, and reveals that widely cherished inference algorithms like maximum likelihood (ML) and maximum-a posteriori (MAP) inference cannot achieve these limits. We further find optimal, computationally tractable algorithms that can achieve these limits. Intriguingly, in high dimensions, these optimal algorithms become computationally simpler than MAP and ML, while still outperforming them. For example, such optimal algorithms can lead to as much as a 20% reduction in the amount of data to achieve the same performance relative to MAP. Moreover, our analysis reveals simple relations between optimal high dimensional inference and low dimensional scalar Bayesian inference, insights into the nature of generalization and predictive power in high dimensions, information theoretic limits on compressed sensing, phase transitions in quadratic inference, and connections to central mathematical objects in convex optimization theory and random matrix theory.
Efficient functional ANOVA through wavelet-domain Markov groves
We introduce a wavelet-domain functional analysis of variance (fANOVA) method based on a Bayesian hierarchical model. The factor effects are modeled through a spike-and-slab mixture at each location-scale combination along with a normal-inverse-Gamma (NIG) conjugate setup for the coefficients and errors. A graphical model called the Markov grove (MG) is designed to jointly model the spike-and-slab statuses at all location-scale combinations, which incorporates the clustering of each factor effect in the wavelet-domain thereby allowing borrowing of strength across location and scale. The posterior of this NIG-MG model is analytically available through a pyramid algorithm of the same computational complexity as Mallat's pyramid algorithm for discrete wavelet transform, i.e., linear in both the number of observations and the number of locations. Posterior probabilities of factor contributions can also be computed through pyramid recursion, and exact samples from the posterior can be drawn without MCMC. We investigate the performance of our method through extensive simulation and show that it outperforms existing wavelet-domain fANOVA methods in a variety of common settings. We apply the method to analyzing the orthosis data.
Stratified Bayesian Optimization
Toscano-Palmerin, Saul, Frazier, Peter I.
We suppose that f has no special structural properties, e.g., concavity, or linearity, that we can exploit to solve this problem, making it a "black blox." We also suppose that evaluating f is costly or time-consuming, making these evaluations "expensive", severely limiting the number of evaluations we may perform. This typically occurs because each evaluation requires running a complex PDE-based or discrete-event simulation, or requires training a machine learning algorithm on a large dataset. When f comes from a discrete-event simulation, this problem is also called "simulation optimization." Bayesian optimization is a popular class of techniques for solving this problem, originating with the seminal paper (Kushner, 1964), and enjoying early contributions from (Mockus et al., 1978; Mockus, 1989). This class of techniques was popularized in the 1990s by the introduction in (Jones et al., 1998) of the most well-known Bayesian optimization method, Efficient Global Optimization (EGO), relying on earlier ideas from (Mockus, 1989). Recently the machine learning community has devoted considerable attention to Bayesian optimization for its applications to tuning computationally intensive machine learning models, as in, e.g., (Snoek et al., 2012). Textbooks and surveys on Bayesian optimization include (Forrester et al., 2008; Brochu et al., 2010). Most work on Bayesian optimization assumes we can observe the objective function directly without noise, but a substantial number of papers, e.g.
Learning Laplacian Matrix in Smooth Graph Signal Representations
Dong, Xiaowen, Thanou, Dorina, Frossard, Pascal, Vandergheynst, Pierre
The construction of a meaningful graph plays a crucial role in the success of many graph-based representations and algorithms for handling structured data, especially in the emerging field of graph signal processing. However, a meaningful graph is not always readily available from the data, nor easy to define depending on the application domain. In particular, it is often desirable in graph signal processing applications that a graph is chosen such that the data admit certain regularity or smoothness on the graph. In this paper, we address the problem of learning graph Laplacians, which is equivalent to learning graph topologies, such that the input data form graph signals with smooth variations on the resulting topology. To this end, we adopt a factor analysis model for the graph signals and impose a Gaussian probabilistic prior on the latent variables that control these signals. We show that the Gaussian prior leads to an efficient representation that favors the smoothness property of the graph signals. We then propose an algorithm for learning graphs that enforces such property and is based on minimizing the variations of the signals on the learned graph. Experiments on both synthetic and real world data demonstrate that the proposed graph learning framework can efficiently infer meaningful graph topologies from signal observations under the smoothness prior.
Scaling up Dynamic Topic Models
Bhadury, Arnab, Chen, Jianfei, Zhu, Jun, Liu, Shixia
Dynamic topic models (DTMs) are very effective in discovering topics and capturing their evolution trends in time series data. To do posterior inference of DTMs, existing methods are all batch algorithms that scan the full dataset before each update of the model and make inexact variational approximations with mean-field assumptions. Due to a lack of a more scalable inference algorithm, despite the usefulness, DTMs have not captured large topic dynamics. This paper fills this research void, and presents a fast and parallelizable inference algorithm using Gibbs Sampling with Stochastic Gradient Langevin Dynamics that does not make any unwarranted assumptions. We also present a Metropolis-Hastings based $O(1)$ sampler for topic assignments for each word token. In a distributed environment, our algorithm requires very little communication between workers during sampling (almost embarrassingly parallel) and scales up to large-scale applications. We are able to learn the largest Dynamic Topic Model to our knowledge, and learned the dynamics of 1,000 topics from 2.6 million documents in less than half an hour, and our empirical results show that our algorithm is not only orders of magnitude faster than the baselines but also achieves lower perplexity.
TribeFlow: Mining & Predicting User Trajectories
Figueiredo, Flavio, Ribeiro, Bruno, Almeida, Jussara, Faloutsos, Christos
Which song will Smith listen to next? Which restaurant will Alice go to tomorrow? Which product will John click next? These applications have in common the prediction of user trajectories that are in a constant state of flux over a hidden network (e.g. website links, geographic location). What users are doing now may be unrelated to what they will be doing in an hour from now. Mindful of these challenges we propose TribeFlow, a method designed to cope with the complex challenges of learning personalized predictive models of non-stationary, transient, and time-heterogeneous user trajectories. TribeFlow is a general method that can perform next product recommendation, next song recommendation, next location prediction, and general arbitrary-length user trajectory prediction without domain-specific knowledge. TribeFlow is more accurate and up to 413x faster than top competitors.