A Simple Approach to Sparse Clustering

Arias-Castro, Ery, Pu, Xiao

arXiv.org Machine Learning 

Consider the problem of sparse clustering, where it is assumed that only a subset of the features are useful for clustering purposes. In the framework of the COSA method of Friedman and Meulman, subsequently improved in the form of the Sparse K-means method of Witten and Tibshirani, a natural and simpler hill-climbing approach is introduced. The new method is shown to be competitive with these two methods and others. Keywords: Sparse Clustering, Hill-climbing, High-dimensional, Feature Selection 1. Introduction Consider a typical setting for clusteringn items based on pairwise dissimilarities, withδ(i,j) denoting the dissimilarity between itemsi,j [n ] {1,...,n } . For concreteness, we assume thatδ(i,j) 0 and δ(i,i) 0 for all i,j [n ] . In principle, if we want to delineateκ clusters, the goal is (for example) to minimize the average within-cluster dissimilarity. Let C n κ denote the class of clusterings ofn items intoκ groups. For C C n κ, its average within-cluster dissimilarity is defined as [C ] k [κ ] 1 C 1 (k) i,j C 1 (k)δ(i,j). If under the Euclidean setting, we further define cluster centers µ k 1 n i C 1 (k)x i with k [κ ], (2) then the within-cluster dissimilarity can be rewritten as follows, [C ] k [κ ] 1 C 1 (k) i,j C 1 (k) x i x j 2 k [κ ] i C 1 (k) x i µ k 2 . The resulting optimization problem is the following: Given (δ(i,j) i,j [n ]), minimize [C ] over C C n κ .

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found