Benchmarking of Clustering Validity Measures Revisited

Simpson, Connor, Campello, Ricardo J. G. B., Stojanovski, Elizabeth

arXiv.org Machine Learning 

Clustering is an unsupervised learning technique that aims to identify patterns that consist of similar or interrelated observations within data [39, 87]. Many existing clustering algorithms are often categorised into three primary groups [39, 82]: partitioning algorithms such as K-Means [39] and Spectral Clustering [88], hierarchical algorithms such as Single Linkage [39] and HDBSCAN* [7, 8], and soft (fuzzy or probabilistic) algorithms such as Fuzzy c-Means (FCM) [4] and Expectation Maximisation with Gaussian Mixture Models (EM-GMM) [20]. Partitioning clustering algorithms partition data into a given number of k clusters, while hierarchical clustering algorithms produce a sequence of nested partitions with incrementally varying numbers of clusters. Soft clustering algorithms are similar to partitioning techniques except that each data observation is assigned a degree of membership or probability to each cluster, rather than a full assignment to a single cluster. It is worth mentioning that within the aforementioned categories there are clustering algorithms that may not necessarily assign all observations to clusters, due to outlier trimming or noise detection. Two examples of such algorithms are trimmed K-means [14] and the previously mentioned HDBSCAN*, each of which may produce solutions where not all observations are assigned to clusters. Clustering validation or validity is an important step of the clustering process irrespective of the algorithm used [39, 25], as it is crucial to determine the best produced partition(s) and number of clusters within the data [23].