Graph Approximation and Clustering on a Budget

Fetaya, Ethan, Shamir, Ohad, Ullman, Shimon

arXiv.org Machine Learning 

Shimon Ullman Weizmann Institute of Science shimon.ullman@weizmann.ac.il Abstract We consider the problem of learning from a similarity matrix (such as spectral clustering and low-dimensional embedding), when computing pairwise similarities are costly, and only a limited number of entries can be observed. We provide a theoretical analysis using standard notions of graph approximation, significantly generalizing previous results (which focused on spectral clustering with two clusters). We also propose a new algorithmic approach based on adaptive sampling, which experimentally matches or improves on previous methods, while being considerably more general and computationally cheaper. 1 Introduction Many unsupervised learning algorithms, such as spectral clustering [18], [2] and low-dimensional embedding via Laplacian eigenmaps and diffusion maps [3],[16], need as input a matrix of pairwise similaritiesW between the different objects in our data. In some cases, obtaining the full matrix can be a costly matter. For example, w ij may be based on some expensive-to-compute metric such as W2D [5]; based on some physical measurement (such as in some computational biology applications); or is given by a human annotator. In such cases, we would like to have a good approximation of the (initially unknown) matrix, while querying only a limited number of entries. An alternative but equivalent viewpoint is the problem of approximating an unknown weighted undirected graph, by querying a limited number of edges. This question has received previous attention in works such as [17] and [9], which focus on the task of spectral clustering into two clusters, and assuming two such distinct clusters indeed exist (i.e. that there is a big gap between the second and third eigenvalues of the Laplacian matrix). In this work we consider, both theoretically and algorithmically, the question of query-based graph approximation more generally, obtaining results relevant beyond two clusters and beyond spectral clustering. When considering graph approximations, the first question is what notion of approximation to consider. One important notion is cut approximation [14] where we wish for every cut in the approximated graph to have weight close to the weight of the cut in the original graph up to a multiplicative factor. Many machine learning algorithms (and many more general algorithms) such as cut based clustering [11], energy minimization [22], etc. [1] are based on cuts, so this notion of approximation is natural for these uses.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found