Goto

Collaborating Authors

 Statistical Learning


Tensor machines for learning target-specific polynomial features

arXiv.org Machine Learning

Recent years have demonstrated that using random feature maps can significantly decrease the training and testing times of kernel-based algorithms without significantly lowering their accuracy. Regrettably, because random features are target-agnostic, typically thousands of such features are necessary to achieve acceptable accuracies. In this work, we consider the problem of learning a small number of explicit polynomial features. Our approach, named Tensor Machines, finds a parsimonious set of features by optimizing over the hypothesis class introduced by Kar and Karnick for random feature maps in a target-specific manner. Exploiting a natural connection between polynomials and tensors, we provide bounds on the generalization error of Tensor Machines. Empirically, Tensor Machines behave favorably on several real-world datasets compared to other state-of-the-art techniques for learning polynomial features, and deliver significantly more parsimonious models.


The geometry of kernelized spectral clustering

arXiv.org Machine Learning

Clustering of data sets is a standard problem in many areas of science and engineering. The method of spectral clustering is based on embedding the data set using a kernel function, and using the top eigenvectors of the normalized Laplacian to recover the connected components. We study the performance of spectral clustering in recovering the latent labels of i.i.d. samples from a finite mixture of nonparametric distributions. The difficulty of this label recovery problem depends on the overlap between mixture components and how easily a mixture component is divided into two nonoverlapping components. When the overlap is small compared to the indivisibility of the mixture components, the principal eigenspace of the population-level normalized Laplacian operator is approximately spanned by the square-root kernelized component densities. In the finite sample setting, and under the same assumption, embedded samples from different components are approximately orthogonal with high probability when the sample size is large. As a corollary we control the fraction of samples mislabeled by spectral clustering under finite mixtures with nonparametric components.


Efficient SDP Inference for Fully-connected CRFs Based on Low-rank Decomposition

arXiv.org Machine Learning

Conditional Random Fields (CRF) have been widely used in a variety of computer vision tasks. Conventional CRFs typically define edges on neighboring image pixels, resulting in a sparse graph such that efficient inference can be performed. However, these CRFs fail to model long-range contextual relationships. Fully-connected CRFs have thus been proposed. While there are efficient approximate inference methods for such CRFs, usually they are sensitive to initialization and make strong assumptions. In this work, we develop an efficient, yet general algorithm for inference on fully-connected CRFs. The algorithm is based on a scalable SDP algorithm and the low- rank approximation of the similarity/kernel matrix. The core of the proposed algorithm is a tailored quasi-Newton method that takes advantage of the low-rank matrix approximation when solving the specialized SDP dual problem. Experiments demonstrate that our method can be applied on fully-connected CRFs that cannot be solved previously, such as pixel-level image co-segmentation.


Hyperparameter Search in Machine Learning

arXiv.org Machine Learning

Machine learning research focuses on the development of methods that are capable of capturing some element of interest from a given data set. Such elements include but are not limited to coherent structures within data (clustering) or the ability to predict certain target values based on given characteristics, which may be discrete (classification) or continuous (regression). A large variety of learning methods exist, ranging from biologically inspired neural networks [7] over kernel methods [29] to ensemble models [9, 11]. A common trait in these methods is that they are parameterized by a set of hyperparameters ฮป, which must be set appropriately by the user to maximize the usefulness of the learning approach. Hyperparameters are used to configure various aspects of the learning algorithm and can have wildly varying effects on the resulting model and its performance. Hyperparameter search is commonly performed manually, via rules-of-thumb [19, 20] or by testing sets of hyperparameters on a predefined grid [28]. These approaches leave much to be desired in terms of reproducibility and are impractical when the number of hyperparameters is large [10]. Due to these flaws, the idea of automating hyperparameter search is receiving increasing amounts of attention in machine learning, for instance via benchmarking suites [15] and various initiatives.


Early Stopping is Nonparametric Variational Inference

arXiv.org Machine Learning

We show that unconverged stochastic gradient descent can be interpreted as a procedure that samples from a nonparametric variational approximate posterior distribution. This distribution is implicitly defined as the transformation of an initial distribution by a sequence of optimization updates. By tracking the change in entropy over this sequence of transformations during optimization, we form a scalable, unbiased estimate of the variational lower bound on the log marginal likelihood. We can use this bound to optimize hyperparameters instead of using cross-validation. This Bayesian interpretation of SGD suggests improved, overfitting-resistant optimization procedures, and gives a theoretical foundation for popular tricks such as early stopping and ensembling. We investigate the properties of this marginal likelihood estimator on neural network models.


Explorations on high dimensional landscapes

arXiv.org Machine Learning

Finding minima of a real valued non-convex function over a high dimensional space is a major challenge in science. We provide evidence that some such functions that are defined on high dimensional domains have a narrow band of values whose pre-image contains the bulk of its critical points. This is in contrast with the low dimensional picture in which this band is wide. Our simulations agree with the previous theoretical work on spin glasses that proves the existence of such a band when the dimension of the domain tends to infinity. Furthermore our experiments on teacher-student networks with the MNIST dataset establish a similar phenomenon in deep networks. We finally observe that both the gradient descent and the stochastic gradient descent methods can reach this level within the same number of steps.


Efficient Dictionary Learning via Very Sparse Random Projections

arXiv.org Machine Learning

Performing signal processing tasks on compressive measurements of data has received great attention in recent years. In this paper, we extend previous work on compressive dictionary learning by showing that more general random projections may be used, including sparse ones. More precisely, we examine compressive K-means clustering as a special case of compressive dictionary learning and give theoretical guarantees for its performance for a very general class of random projections. We then propose a memory and computation efficient dictionary learning algorithm, specifically designed for analyzing large volumes of high-dimensional data, which learns the dictionary from very sparse random projections. Experimental results demonstrate that our approach allows for reduction of computational complexity and memory/data access, with controllable loss in accuracy.


DUKE: A Solution for Discovering Neighborhood Patterns in Ego Networks

AAAI Conferences

Given the rapid growth of social media websites and the ease of aggregating ever-richer social data, an inevitable research question that can be expected to emerge is whether different interaction patterns of individuals and their meaningful interpretation can be captured for social network analysis. In this work, we present a novel solution that discovers occurrences of prototypical 'ego network' patterns from social media and mobile-phone networks, to provide a data-driven instrument to be used in behavioral sciences for graph interpretations. We analyze nine datasets gathered from social media websites and mobile phones, together with 13 network measures, and three unsupervised clustering algorithms. Further, we use an unsupervised feature similarity technique to reduce redundancy and extract compact features from the data. The reduced feature subsets are then used to discover ego patterns using various clustering techniques. By cluster analysis, we discover that eight distinct ego neighborhood patterns or ego graphs have emerged. This categorization allows concise analysis of users' data as they change over time. We provide fine-grained analysis for the validity and quality of clustering results. We perform clustering verification based on the following three intuitions: i) analyzing the clustering patterns for the same set of users crawled from three social media networks, ii) associating metadata information with the clusters and evaluating their performance on real networks, iii) studying selected participants over an extended period to analyze their behavior.


Analysis and Prediction of Question Topic Popularity in Community Q&A Sites: A Case Study of Quora

AAAI Conferences

In the past few years, Quora a community-driven social platform for question and answering, has grown exponentially from a small community of users into one of the largest and reliable source of Q&A on the Internet. Quora has a built-in social structure integrated to its backbone; users can follow each other, follow question, topics etc. Apart from the social connections that Quora provides, it has developed a knowledge base nicely organized via hierarchy and relatedness of topics. In this paper, we consider a massive dataset of more than four years and analyze the dynamics of topical growth over time; how various factors affect the popularity of a topic or its acceptance in Q&A community. We also propose a regression model to predict the popularity of the topics and discuss the important discriminating features. We achieve a high prediction accuracy (correlation coefficient ~0.773) with low root mean square error (~1.065). We further categorize thetopics into a few broad classes by implementing a simple Latent Dirichlet Allocation (LDA) model on the question texts associated with the topics. In comparison to the data sample with no categorization, this stratification of the topics enhances the prediction accuracies for several categories. However, for certain categories there seems to a slight decrease in the accuracy values and we present an in-depth discussion analyzing the cause for the same pointing out potential ways for improvement. We believe that this thorough measurement study will have a direct application to a service like recommending trending topics in Quora.


Sync-Rank: Robust Ranking, Constrained Ranking and Rank Aggregation via Eigenvector and Semidefinite Programming Synchronization

arXiv.org Machine Learning

We consider the classic problem of establishing a statistical ranking of a set of n items given a set of inconsistent and incomplete pairwise comparisons between such items. Instantiations of this problem occur in numerous applications in data analysis (e.g., ranking teams in sports data), computer vision, and machine learning. We formulate the above problem of ranking with incomplete noisy information as an instance of the group synchronization problem over the group SO(2) of planar rotations, whose usefulness has been demonstrated in numerous applications in recent years. Its least squares solution can be approximated by either a spectral or a semidefinite programming (SDP) relaxation, followed by a rounding procedure. We perform extensive numerical simulations on both synthetic and real-world data sets, showing that our proposed method compares favorably to other algorithms from the recent literature. Existing theoretical guarantees on the group synchronization problem imply lower bounds on the largest amount of noise permissible in the ranking data while still achieving exact recovery. We propose a similar synchronization-based algorithm for the rank-aggregation problem, which integrates in a globally consistent ranking pairwise comparisons given by different rating systems on the same set of items. We also discuss the problem of semi-supervised ranking when there is available information on the ground truth rank of a subset of players, and propose an algorithm based on SDP which recovers the ranks of the remaining players. Finally, synchronization-based ranking, combined with a spectral technique for the densest subgraph problem, allows one to extract locally-consistent partial rankings, in other words, to identify the rank of a small subset of players whose pairwise comparisons are less noisy than the rest of the data, which other methods are not able to identify.