Statistical Learning
From Frequency to Meaning: Vector Space Models of Semantics
Computers understand very little of the meaning of human language. This profoundly limits our ability to give instructions to computers, the ability of computers to explain their actions to us, and the ability of computers to analyse and process text. Vector space models (VSMs) of semantics are beginning to address these limits. This paper surveys the use of VSMs for semantic processing of text. We organize the literature on VSMs according to the structure of the matrix in a VSM. There are currently three broad classes of VSMs, based on term-document, word-context, and pair-pattern matrices, yielding three classes of applications. We survey a broad range of applications in these three categories and we take a detailed look at a specific open source project in each category. Our goal in this survey is to show the breadth of applications of VSMs for semantics, to provide a new perspective on VSMs for those who are already familiar with the area, and to provide pointers into the literature for those who are less familiar with the field.
Feature Hashing for Large Scale Multitask Learning
Weinberger, Kilian, Dasgupta, Anirban, Attenberg, Josh, Langford, John, Smola, Alex
Empirical evidence suggests that hashing is an effective strategy for dimensionality reduction and practical nonparametric estimation. In this paper we provide exponential tail bounds for feature hashing and show that the interaction between random subspaces is negligible with high probability. We demonstrate the feasibility of this approach with experimental results for a new use case -- multitask learning with hundreds of thousands of tasks.
Less Regret via Online Conditioning
Streeter, Matthew, McMahan, H. Brendan
In the past few years, online algorithms have emerged as state-of-the-art techniques for solving large-scale machine learning problems [2, 13, 16]. In addition to their simplicity and generality, online algorithms are natural choices for problems where new data is constantly arriving and rapid adaptation is imporant. Compared to the study of convex optimization in the batch (offline) setting, the study of online convex optimization is relatively new. In light of this, it is not surprising that performance-improving techniques that are well known and widely used in the batch setting do not yet have online analogues. In particular, convergence rates in the batch setting can often be dramatically improved through the use of preconditioning. Yet, the online convex optimization literature provides no comparable method for improving regret(the online analogue of convergence rates).
Syntactic Topic Models
Boyd-Graber, Jordan, Blei, David M.
The syntactic topic model (STM) is a Bayesian nonparametric model of language that discovers latent distributions of words (topics) that are both semantically and syntactically coherent. The STM models dependency parsed corpora where sentences are grouped into documents. It assumes that each word is drawn from a latent topic chosen by combining document-level features and the local syntactic context. Each document has a distribution over latent topics, as in topic models, which provides the semantic consistency. Each element in the dependency parse tree also has a distribution over the topics of its children, as in latent-state syntax models, which provides the syntactic consistency. These distributions are convolved so that the topic of each word is likely under both its document and syntactic context. We derive a fast posterior inference algorithm based on variational methods. We report qualitative and quantitative studies on both synthetic data and hand-parsed documents. We show that the STM is a more predictive model of language than current models based only on syntax or only on topics.
Multibiometrics Belief Fusion
Kisku, Dakshina Ranjan, Sing, Jamuna Kanta, Gupta, Phalguni
This paper proposes a multimodal biometric system through Gaussian Mixture Model (GMM) for face and ear biometrics with belief fusion of the estimated scores characterized by Gabor responses and the proposed fusion is accomplished by Dempster-Shafer (DS) decision theory. Face and ear images are convolved with Gabor wavelet filters to extracts spatially enhanced Gabor facial features and Gabor ear features. Further, GMM is applied to the high-dimensional Gabor face and Gabor ear responses separately for quantitive measurements. Expectation Maximization (EM) algorithm is used to estimate density parameters in GMM. This produces two sets of feature vectors which are then fused using Dempster-Shafer theory. Experiments are conducted on multimodal database containing face and ear images of 400 individuals. It is found that use of Gabor wavelet filters along with GMM and DS theory can provide robust and efficient multimodal fusion strategy.
Operator norm convergence of spectral clustering on level sets
Pelletier, Bruno, Pudlo, Pierre
The aim of data clustering, or unsupervised classification, is to partition a data set into several homogeneous groups relatively separated one from each other with respect to a certain distance or notion of similarity. There exists an extensive literature on clustering methods, and we refer the reader to Anderberg [1973], Hartigan [1975], McLachlan and Peel [2000], Chapter 10 in Duda et al. [2000], and Chapter 14 in Hastie et al. [2001] for general materials on the subject. In particular, popular clustering algorithms, such as Gaussian mixture models or k-means, have proved useful in a number of applications, yet they suffer from some internal and computational limitations. Indeed, the parametric assumption at the core of mixture models may be too stringent, while the standard k-means algorithm fails at identifying complex shaped, possibly non-convex, clusters. The class of spectral clustering algorithms is presently emerging as a promising alternative, showing improved performance over classical clustering algorithms on several benchmark problems and applications; see e.g., Ng et al. [2002], von Luxburg [2007]. An overview of spectral clustering algorithms may be found in von Luxburg [2007], and connections with kernel methods are exposed in Fillipone et al. [2008]. The spectral clustering algorithm amounts at embedding the data into a feature space by using the eigenvectors of the similarity matrix in such a way that the clusters may be separated using simple rules, e.g. a separation by hyperplanes. The core component of the spectral clustering algorithm is therefore the similarity matrix, or certain normalizations of it, generally called graph Laplacian matrices; see Chung [1997]. Graph Laplacian matrices may be viewed as discrete versions of bounded operators between functional spaces.
A Generalization of the Chow-Liu Algorithm and its Application to Statistical Learning
We extend the Chow-Liu algorithm for general random variables while the previous versions only considered finite cases. In particular, this paper applies the generalization to Suzuki's learning algorithm that generates from data forests rather than trees based on the minimum description length by balancing the fitness of the data to the forest and the simplicity of the forest. As a result, we successfully obtain an algorithm when both of the Gaussian and finite random variables are present.
Manifold-Based Signal Recovery and Parameter Estimation from Compressive Measurements
A field known as Compressive Sensing (CS) has recently emerged to help address the growing challenges of capturing and processing high-dimensional signals and data sets. CS exploits the surprising fact that the information contained in a sparse signal can be preserved in a small number of compressive (or random) linear measurements of that signal. Strong theoretical guarantees have been established on the accuracy to which sparse or near-sparse signals can be recovered from noisy compressive measurements. In this paper, we address similar questions in the context of a different modeling framework. Instead of sparse models, we focus on the broad class of manifold models, which can arise in both parametric and non-parametric signal families. Building upon recent results concerning the stable embeddings of manifolds within the measurement space, we establish both deterministic and probabilistic instance-optimal bounds in $\ell_2$ for manifold-based signal recovery and parameter estimation from noisy compressive measurements. In line with analogous results for sparsity-based CS, we conclude that much stronger bounds are possible in the probabilistic setting. Our work supports the growing empirical evidence that manifold-based models can be used with high accuracy in compressive signal processing.
K-Dimensional Coding Schemes in Hilbert Spaces
Pontil, Andreas Maurer Massimiliano
This paper presents a general coding method where data in a Hilbert space are represented by finite dimensional coding vectors. The method is based on empirical risk minimization within a certain class of linear operators, which map the set of coding vectors to the Hilbert space. Two results bounding the expected reconstruction error of the method are derived, which highlight the role played by the codebook and the class of linear operators. The results are specialized to some cases of practical importance, including K-means clustering, nonnegative matrix factorization and other sparse coding methods.
Error-Correcting Tournaments
Beygelzimer, Alina, Langford, John, Ravikumar, Pradeep
Text of abstract We present a family of pairwise tournaments reducing k-class classification to binary classification. These reductions are provably robust against a constant fraction of binary errors, simultaneously matching the best possible computation O(log k) and regret O(1). The construction also works for robustly selecting the best of k-choices by tournament. We strengthen previous results by defeating a more powerful adversary than previously addressed while providing a new form of analysis. In this setting, the error correcting tournament has depth O(log k) while using O(klogk) comparators, both optimal up to a small constant. Keywords: reductions, multiclass classification, cost-sensitive learning, tournaments, robust search 1. Introduction We consider the classical problem of multiclass classification, where given an instance x X, the goal is to predict the most likely label y {1,...,k}, according to some unknown probability distribution. A common general approach to multiclass learning is to reduce a multiclass problem to a set of binary classification problems [2, 7, 11, 12, 15].