The algorithmic advancement of synchronizing maps is important in order to solve a wide range of practice problems with possible large-scale dataset. In this paper, we provide theoretical justifications for spectral techniques for the map synchronization problem, i.e., it takes as input a collection of objects and noisy maps estimated between pairs of objects, and outputs clean maps between all pairs of objects. We show that a simple normalized spectral method that projects the blocks of the top eigenvectors of a data matrix to the map space leads to surprisingly good results. As the noise is modelled naturally as random permutation matrix, this algorithm NormSpecSync leads to competing theoretical guarantees as state-of-the-art convex optimization techniques, yet it is much more efficient. We demonstrate the usefulness of our algorithm in a couple of applications, where it is optimal in both complexity and exactness among existing methods.
Crowdsourcing utilizes human ability by distributing tasks to a large number of workers. It is especially suitable for solving data clustering problems because it provides a way to obtain a similarity measure between objects based on manual annotations, which capture the human perception of similarity among objects.This is in contrast to most clustering algorithms that face the challenge of finding an appropriate similarity measure for the given dataset. Several algorithms have been developed for crowdclustering that combine partial clustering results, each obtained by annotations provided by a different worker, into a single data partition. However, existing crowd-clustering approaches require a large number of annotations, due to the noisy nature of human annotations, leading to a high computational cost in addition to the large cost associated with annotation. We address this problem by developing a novel approach for crowclustering that exploits the technique of matrix completion. Instead of using all the annotations, the proposed algorithm constructs a partially observed similarity matrix based on a subset of pairwise annotation labels that are agreed upon by most annotators. It then deploys the matrix completion algorithm to complete the similarity matrix and obtains the final data partition by applying a spectral clustering algorithm to the completed similarity matrix. We show, both theoretically and empirically, that the proposed approach needs only a small number of manual annotations to obtain an accurate data partition. In effect, we highlight the trade-off between a large number of noisy crowdsourced labels and a small number of high quality labels.
Similarity plays a fundamental role in many areas, including data mining, machine learning, statistics and various applied domains. Inspired by the success of ensemble methods and the flexibility of trees, we propose to learn a similarity kernel called rpf-kernel through random projection forests (rpForests). Our theoretical analysis reveals a highly desirable property of rpf-kernel: far-away (dissimilar) points have a low similarity value while nearby (similar) points would have a high similarity}, and the similarities have a native interpretation as the probability of points remaining in the same leaf nodes during the growth of rpForests. The learned rpf-kernel leads to an effective clustering algorithm--rpfCluster. On a wide variety of real and benchmark datasets, rpfCluster compares favorably to K-means clustering, spectral clustering and a state-of-the-art clustering ensemble algorithm--Cluster Forests. Our approach is simple to implement and readily adapt to the geometry of the underlying data. Given its desirable theoretical property and competitive empirical performance when applied to clustering, we expect rpf-kernel to be applicable to many problems of an unsupervised nature or as a regularizer in some supervised or weakly supervised settings.
Spectral clustering, a graph partitioning technique, has gained immense popularity in machine learning in the context of unsupervised learning. This is due to convincing empirical studies, elegant approaches involved and the theoretical guarantees provided in the literature. To tackle some challenging problems that arose in computer vision etc., recently, a need to develop spectral methods that incorporate multi-way similarity measures surfaced. This, in turn, leads to a hypergraph partitioning problem. In this paper, we formulate a criterion for partitioning uniform hypergraphs, and show that a relaxation of this problem is related to the multilinear singular value decomposition (SVD) of symmetric tensors. Using this, we provide a spectral technique for clustering based on higher order affinities, and derive a theoretical bound on the error incurred by this method. We also study the complexity of the algorithm and use Nystr ̈om’s method and column sampling techniques to develop approximate methods with significantly reduced complexity. Experiments on geometric grouping and motion segmentation demonstrate the practical significance of the proposed methods.
Several important applications, such as streaming PCA and semidefinite programming, involve a large-scale positive-semidefinite (psd) matrix that is presented as a sequence of linear updates. Because of storage limitations, it may only be possible to retain a sketch of the psd matrix. This paper develops a new algorithm for fixed-rank psd approximation from a sketch. The approach combines the Nyström approximation with a novel mechanism for rank truncation. Theoretical analysis establishes that the proposed method can achieve any prescribed relative error in the Schatten 1-norm and that it exploits the spectral decay of the input matrix.