Statistical Learning
Streaming, Distributed Variational Inference for Bayesian Nonparametrics
Campbell, Trevor, Straub, Julian, Fisher, John W. III, How, Jonathan P.
This paper presents a methodology for creating streaming, distributed inference algorithms for Bayesian nonparametric (BNP) models. In the proposed framework, processing nodes receive a sequence of data minibatches, compute a variational posterior for each, and make asynchronous streaming updates to a central model. In contrast to previous algorithms, the proposed framework is truly streaming, distributed, asynchronous, learning-rate-free, and truncation-free. The key challenge in developing the framework, arising from the fact that BNP models do not impose an inherent ordering on their components, is finding the correspondence between minibatch and central BNP posterior components before performing each update. To address this, the paper develops a combinatorial optimization problem over component correspondences, and provides an efficient solution technique. The paper concludes with an application of the methodology to the DP mixture model, with experimental results demonstrating its practical scalability and performance.
A Study of the Spatio-Temporal Correlations in Mobile Calls Networks
Guigourès, Romain, Boullé, Marc, Rossi, Fabrice
For the last few years, the amount of data has significantly increased in the companies. It is the reason why data analysis methods have to evolve to meet new demands. In this article, we introduce a practical analysis of a large database from a telecommunication operator. The problem is to segment a territory and characterize the retrieved areas owing to their inhabitant behavior in terms of mobile telephony. We have call detail records collected during five months in France. We propose a two stages analysis. The first one aims at grouping source antennas which originating calls are similarly distributed on target antennas and conversely for target antenna w.r.t. source antenna. A geographic projection of the data is used to display the results on a map of France. The second stage discretizes the time into periods between which we note changes in distributions of calls emerging from the clusters of source antennas. This enables an analysis of temporal changes of inhabitants behavior in every area of the country.
Robust Subspace Clustering via Tighter Rank Approximation
Kang, Zhao, Peng, Chong, Cheng, Qiang
Matrix rank minimization problem is in general NP-hard. The nuclear norm is used to substitute the rank function in many recent studies. Nevertheless, the nuclear norm approximation adds all singular values together and the approximation error may depend heavily on the magnitudes of singular values. This might restrict its capability in dealing with many practical problems. In this paper, an arctangent function is used as a tighter approximation to the rank function. We use it on the challenging subspace clustering problem. For this nonconvex minimization problem, we develop an effective optimization procedure based on a type of augmented Lagrange multipliers (ALM) method. Extensive experiments on face clustering and motion segmentation show that the proposed method is effective for rank approximation.
A Unified Framework for Representation-based Subspace Clustering of Out-of-sample and Large-scale Data
Peng, Xi, Tang, Huajin, Zhang, Lei, Yi, Zhang, Xiao, Shijie
Under the framework of spectral clustering, the key of subspace clustering is building a similarity graph which describes the neighborhood relations among data points. Some recent works build the graph using sparse, low-rank, and $\ell_2$-norm-based representation, and have achieved state-of-the-art performance. However, these methods have suffered from the following two limitations. First, the time complexities of these methods are at least proportional to the cube of the data size, which make those methods inefficient for solving large-scale problems. Second, they cannot cope with out-of-sample data that are not used to construct the similarity graph. To cluster each out-of-sample datum, the methods have to recalculate the similarity graph and the cluster membership of the whole data set. In this paper, we propose a unified framework which makes representation-based subspace clustering algorithms feasible to cluster both out-of-sample and large-scale data. Under our framework, the large-scale problem is tackled by converting it as out-of-sample problem in the manner of "sampling, clustering, coding, and classifying". Furthermore, we give an estimation for the error bounds by treating each subspace as a point in a hyperspace. Extensive experimental results on various benchmark data sets show that our methods outperform several recently-proposed scalable methods in clustering large-scale data set.
Nonconvex Penalization in Sparse Estimation: An Approach Based on the Bernstein Function
In this paper we study nonconvex penalization using Bernstein functions whose first-order derivatives are completely monotone. The Bernstein function can induce a class of nonconvex penalty functions for high-dimensional sparse estimation problems. We derive a thresholding function based on the Bernstein penalty and discuss some important mathematical properties in sparsity modeling. We show that a coordinate descent algorithm is especially appropriate for regression problems penalized by the Bernstein function. We also consider the application of the Bernstein penalty in classification problems and devise a proximal alternating linearized minimization method. Based on theory of the Kurdyka-Lojasiewicz inequality, we conduct convergence analysis of these alternating iteration procedures. We particularly exemplify a family of Bernstein nonconvex penalties based on a generalized Gamma measure and conduct empirical analysis for this family.
Covariance-Controlled Adaptive Langevin Thermostat for Large-Scale Bayesian Sampling
Shang, Xiaocheng, Zhu, Zhanxing, Leimkuhler, Benedict, Storkey, Amos J.
Monte Carlo sampling for Bayesian posterior inference is a common approach used in machine learning. The Markov Chain Monte Carlo procedures that are used are often discrete-time analogues of associated stochastic differential equations (SDEs). These SDEs are guaranteed to leave invariant the required posterior distribution. An area of current research addresses the computational benefits of stochastic gradient methods in this setting. Existing techniques rely on estimating the variance or covariance of the subsampling error, and typically assume constant variance. In this article, we propose a covariance-controlled adaptive Langevin thermostat that can effectively dissipate parameter-dependent noise while maintaining a desired target distribution. The proposed method achieves a substantial speedup over popular alternative schemes for large-scale machine learning applications.
Higher-order asymptotics for the parametric complexity
The minimum description length (MDL) principle provides a general information-theoretic approach to model selection and other forms of statistical inference [5, 17]. The MDL criterion for model selection is consistent, meaning that it will select the data-generating model from a countable set of competing parametric models with probability approaching 1 as the sample size n goes to infinity [4]. For example, if each of the parametric models is a logistic regression model with predictor variables taken from a fixed set of potential predictors, then the MDL model-selection criterion will choose the correct combination of predictors with probability approaching 1 as n . The MDL model-selection criterion also has a number of strong optimality properties, which greatly extend Shannon's noiseless coding theorem [5, §III.E]. In its simplest form, the MDL principle advocates choosing the model for which the observed data has the shortest message length under a particular prefix code defined by a minimax condition [11, §2.4.3]. Shtarkov [19] showed that this is equivalent to choosing the model with the largest normalized maximum likelihood (NML) for the observed data.
Principal Differences Analysis: Interpretable Characterization of Differences between Distributions
Mueller, Jonas, Jaakkola, Tommi
We introduce principal differences analysis (PDA) for analyzing differences between high-dimensional distributions. The method operates by finding the projection that maximizes the Wasserstein divergence between the resulting univariate populations. Relying on the Cramer-Wold device, it requires no assumptions about the form of the underlying distributions, nor the nature of their inter-class differences. A sparse variant of the method is introduced to identify features responsible for the differences. We provide algorithms for both the original minimax formulation as well as its semidefinite relaxation. In addition to deriving some convergence results, we illustrate how the approach may be applied to identify differences between cell populations in the somatosensory cortex and hippocampus as manifested by single cell RNA-seq. Our broader framework extends beyond the specific choice of Wasserstein divergence.
Robust Gaussian Graphical Modeling with the Trimmed Graphical Lasso
Yang, Eunho, Lozano, Aurélie C.
Gaussian Graphical Models (GGMs) are popular tools for studying network structures. However, many modern applications such as gene network discovery and social interactions analysis often involve high-dimensional noisy data with outliers or heavier tails than the Gaussian distribution. In this paper, we propose the Trimmed Graphical Lasso for robust estimation of sparse GGMs. Our method guards against outliers by an implicit trimming mechanism akin to the popular Least Trimmed Squares method used for linear regression. We provide a rigorous statistical analysis of our estimator in the high-dimensional setting. In contrast, existing approaches for robust sparse GGMs estimation lack statistical guarantees. Our theoretical results are complemented by experiments on simulated and real gene expression data which further demonstrate the value of our approach.
Fast Landmark Subspace Clustering
Kernel methods obtain superb performance in terms of accuracy for various machine learning tasks since they can effectively extract nonlinear relations. However, their time complexity can be rather large especially for clustering tasks. In this paper we define a general class of kernels that can be easily approximated by randomization. These kernels appear in various applications, in particular, traditional spectral clustering, landmark-based spectral clustering and landmark-based subspace clustering. We show that for $n$ data points from $K$ clusters with $D$ landmarks, the randomization procedure results in an algorithm of complexity $O(KnD)$. Furthermore, we bound the error between the original clustering scheme and its randomization. To illustrate the power of this framework, we propose a new fast landmark subspace (FLS) clustering algorithm. Experiments over synthetic and real datasets demonstrate the superior performance of FLS in accelerating subspace clustering with marginal sacrifice of accuracy.