Statistical Learning
Structural Similarity and Distance in Learning
Wang, Joseph, Saligrama, Venkatesh, Castañón, David A.
The notion of distance, or more generally, similarity between observations, is at the root of most learning algorithms such as manifold learning, unsupervised clustering, semi-supervised learning, and anomaly detection. Most methods, at a basic level, are based on some function of Euclidean distance, such as the radial basis functions prevalent in supervised classification. Graphbased learning methods employ Euclidean distances to describe local neighborhoods for observations. K-nearest neighbors and ǫ-neighborhoods based on Euclidean distances are used in manifold learning (Isomap [1] and LLE [2]), spectral clustering [3], anomaly detection algorithms [4], [5] and in label propagation algorithms for semi-supervised learning [6]. In many cases, notably sparsely sampled sets of data, Euclidean neighborhoods are not sufficient to represent underlying structure. Consider data drawn from two independent structures. Ideally, a graph would capture the structure of the data with minimal connection between observations on separate structures.
A Flexible, Scalable and Efficient Algorithmic Framework for Primal Graphical Lasso
Mazumder, Rahul, Agarwal, Deepak K.
We propose a scalable, efficient and statistically motivated computational framework for Graphical Lasso (Friedman et al., 2007b) - a covariance regularization framework that has received significant attention in the statistics community over the past few years. Existing algorithms have trouble in scaling to dimensions larger than a thousand. Our proposal significantly enhances the state-of-the-art for such moderate sized problems and gracefully scales to larger problems where other algorithms become practically infeasible. This requires a few key new ideas. We operate on the primal problem and use a subtle variation of block-coordinate-methods which drastically reduces the computational complexity by orders of magnitude. We provide rigorous theoretical guarantees on the convergence and complexity of our algorithm and demonstrate the effectiveness of our proposal via experiments. We believe that our framework extends the applicability of Graphical Lasso to large-scale modern applications like bioinformatics, collaborative filtering and social networks, among others.
Multiple Gaussian Process Models
Archambeau, Cedric, Bach, Francis
We consider a Gaussian process formulation of the multiple kernel learning problem. The goal is to select the convex combination of kernel matrices that best explains the data and by doing so improve the generalisation on unseen data. Sparsity in the kernel weights is obtained by adopting a hierarchical Bayesian approach: Gaussian process priors are imposed over the latent functions and generalised inverse Gaussians on their associated weights. This construction is equivalent to imposing a product of heavy-tailed process priors over function space. A variational inference algorithm is derived for regression and binary classification. 1 Introduction Kernel-based methods are well-established tools for supervised learning, allowing to perform various tasks, such as regression or binary classification, with linear and nonlinear predictors.
Infinitely exchangeable random graphs generated from a Poisson point process on monotone sets and applications to cluster analysis for networks
We construct an infinitely exchangeable process on the set $\cate$ of subsets of the power set of the natural numbers $\mathbb{N}$ via a Poisson point process with mean measure $\Lambda$ on the power set of $\mathbb{N}$. Each $E\in\cate$ has a least monotone cover in $\catf$, the collection of monotone subsets of $\cate$, and every monotone subset maps to an undirected graph $G\in\catg$, the space of undirected graphs with vertex set $\mathbb{N}$. We show a natural mapping $\cate\rightarrow\catf\rightarrow\catg$ which induces an infinitely exchangeable measure on the projective system $\catg^{\rest}$ of graphs $\catg$ under permutation and restriction mappings given an infinitely exchangeable family of measures on the projective system $\cate^{\rest}$ of subsets with permutation and restriction maps. We show potential connections of this process to applications in cluster analysis, machine learning, classification and Bayesian inference.
Multi-Instance Multi-Label Learning
Zhou, Zhi-Hua, Zhang, Min-Ling, Huang, Sheng-Jun, Li, Yu-Feng
Nanjing University, Nanjing 210046, China Abstract In this paper, we propose the MIML (Multi-Instance Multi-Label learning) framework where an example is described by multiple instances and associated with multiple class labels. Compared to traditional learning frameworks, the MIML framework is more convenient and natural for representing complicated objects which have multiple semantic meanings. To learn from MIML examples, we propose the MimlBoost and MimlSvm algorithms based on a simple degeneration strategy, and experiments show that solving problems involving complicated objects with multiple semantic meanings in the MIML framework can lead to good performance. Considering that the degeneration process may lose information, we propose the D-MimlSvm algorithm which tackles MIML problems directly in a regularization framework. Moreover, we show that even when we do not have access to the real objects and thus cannot capture more information from real objects by using the MIML representation, MIML is still useful. We propose the InsDif and SubCod algorithms. InsDif works by transforming single-instances into the MIML representation for learning, while SubCod works by transforming single-label examples into the MIML representation for learning. Experiments show that in some tasks they are able to achieve better performance than learning the single-instances or single-label examples directly. Email: zhouzh@lamda.nju.edu.cn 1 Introduction In traditional supervised learning, an object is represented by an instance, i.e., a feature vector, and associated with a class label. Formally, let X denote the instance space (or feature space) andY the set of class labels. In particular, each object in this framework belongs to only one concept and therefore the corresponding instance is associated with a single class label. However, many real-world objects are complicated, which may belong to multiple concepts simultaneously. For example, an image can belong to several classes simultaneously, e.g., grasslands, lions, Africa, etc.; a text document can be classified to several categories if it is viewed from different aspects, e.g., scientific novel, Jules Verne's writing or even books on traveling;aweb page can be recognized as news page, sports page, soccer page, etc. In a specific real task, maybe only one of the multiple concepts is the right semantic meaning. For example, in image retrieval when a user is interested in an image with lions, s/he may be only interested in the concept lions instead of the other concepts grasslands and Africa associated with that image. The difficulty here is caused by those objects that involve multiple concepts. To choose the right semantic meaning for such objects for a specific scenario is the fundamental difficulty of many tasks.
Kernel Topic Models
Hennig, Philipp, Stern, David, Herbrich, Ralf, Graepel, Thore
We study a variation of this concept, in which the documents' mixture weight beliefs are replaced with squashed Gaussian distributions. This allows documents to be associated with elements of a Hilbert space, admitting kernel topic models (KTM), modelling temporal, spatial, hierarchical, social and other structure between documents. The main challenge is efficient approximate inference on the latent Gaussian. We present an approximate algorithm cast around a Laplace approximation in a transformed basis. The KTM can also be interpreted as a type of Gaussian process latent variable model, or as a topic model conditional on document features, uncovering links between earlier work in these areas.
Gaussian Process Regression Networks
Wilson, Andrew Gordon, Knowles, David A., Ghahramani, Zoubin
We introduce a new regression framework, Gaussian process regression networks (GPRN), which combines the structural properties of Bayesian neural networks with the non-parametric flexibility of Gaussian processes. This model accommodates input dependent signal and noise correlations between multiple response variables, input dependent length-scales and amplitudes, and heavy-tailed predictive distributions. We derive both efficient Markov chain Monte Carlo and variational Bayes inference procedures for this model. We apply GPRN as a multiple output regression and multivariate volatility model, demonstrating substantially improved performance over eight popular multiple output (multi-task) Gaussian process models and three multivariate volatility models on benchmark datasets, including a 1000 dimensional gene expression dataset.
A fast and recursive algorithm for clustering large datasets with $k$-medians
Cardot, Hervé, Cénac, Peggy, Monnez, Jean-Marie
Clustering with fast algorithms large samples of high dimensional data is an important challenge in computational statistics. Borrowing ideas from MacQueen (1967) who introduced a sequential version of the $k$-means algorithm, a new class of recursive stochastic gradient algorithms designed for the $k$-medians loss criterion is proposed. By their recursive nature, these algorithms are very fast and are well adapted to deal with large samples of data that are allowed to arrive sequentially. It is proved that the stochastic gradient algorithm converges almost surely to the set of stationary points of the underlying loss criterion. A particular attention is paid to the averaged versions, which are known to have better performances, and a data-driven procedure that allows automatic selection of the value of the descent step is proposed. The performance of the averaged sequential estimator is compared on a simulation study, both in terms of computation speed and accuracy of the estimations, with more classical partitioning techniques such as $k$-means, trimmed $k$-means and PAM (partitioning around medoids). Finally, this new online clustering technique is illustrated on determining television audience profiles with a sample of more than 5000 individual television audiences measured every minute over a period of 24 hours.
Efficient Latent Variable Graphical Model Selection via Split Bregman Method
Ye, Gui-Bo, Wang, Yuanfeng, Chen, Yifei, Xie, Xiaohui
Abstract: We consider the problem of covariance matrix estimation in the presence of latent variables. Under suitable conditions, it is possible to learn the marginal covariance matrix of the observed variables via a tractable convex program, where the concentration matrix of the observed variables is decomposed into a sparse matrix (representing the graphical structure of the observed variables) and a low rank matrix (representing the marginalization effect of latent variables). We present an efficient first-order method based on split Bregman to solve the convex problem. The algorithm is guaranteed to converge under mild conditions. We show that our algorithm is significantly faster than the state-of-the-art algorithm on both artificial and real-world data. Applying the algorithm to a gene expression data involving thousands of genes, we show that most of the correlation between observed variables can be explained by only a few dozen latent factors.
Large-Margin Learning of Submodular Summarization Methods
Sipos, Ruben, Shivaswamy, Pannaga, Joachims, Thorsten
In this paper, we present a supervised learning approach to training submodular scoring functions for extractive multi-document summarization. By taking a structured predicition approach, we provide a large-margin method that directly optimizes a convex relaxation of the desired performance measure. The learning method applies to all submodular summarization methods, and we demonstrate its effectiveness for both pairwise as well as coverage-based scoring functions on multiple datasets. Compared to state-of-the-art functions that were tuned manually, our method significantly improves performance and enables high-fidelity models with numbers of parameters well beyond what could reasonbly be tuned by hand.