CNAK : Cluster Number Assisted K-means

Saha, Jayasree, Mukherjee, Jayanta

arXiv.org Machine Learning 

Determining the number of clusters present in a dataset is an important problem in cluster analysis. Conventional clustering techniques generally assume this parameter to be provided up front. In this paper, we propose a method which analyzes cluster stability for predicting the cluster number. Under the same computational framework, the technique also finds representatives of the clusters. The method is apt for handling big data, as we design the algorithm using Monte-Carlo simulation. Also, we explore a few pertinent issues found to be of also clustering. Experiments reveal that the proposed method is capable of identifying a single cluster. It is robust in handling high dimensional dataset and performs reasonably well over datasets having cluster imbalance. Moreover, it can indicate cluster hierarchy, if present. Overall we have observed significant improvement in speed and quality for predicting cluster numbers as well as the composition of clusters in a large dataset. Keywords: k-means clustering, Bipartite graph, Perfect Matching, Kuhn-Munkres Algorithm, Monte Carlo simulation. 1. Introduction In cluster analysis, it is required to group a set of data points in a multidimensional space, so that data points in the same group are more similar to each other than to those in other groups. These groups are called clusters. Various distance functions may be used to compute the degree of similarity or dissimilarity among these data points. Typically Euclidean distance function is widely used in clustering. The aim of this unsupervised technique is to increase homogeneity in a group and heterogeneity between groups. Several clustering methods with different characteristics have been proposed for different purposes. Some well-known methods include partition-based clustering [26], hierarchical clustering [25], spectral clustering [27], density-based clustering [12]. However, they require the knowledge of cluster number for a given dataset a priori [12, 21, 26, 27, 36].

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found