Reviews: Affinity Clustering: Hierarchical Clustering at Scale

Neural Information Processing Systems 

The paper focuses on the development of the field of distributed hierarchical clustering. The authors propose a novel class of algorithms tagged'affinity clustering' that operate on the basis of Boruvka's seminal work on minimal spanning trees and contrast those to linkage clustering algorithms (which are based on Kruskal's work). The authors systematically introduce the theoretical underpinnings of affinity clustering, before proposing'certificates' as a metric to characterise clustering algorithm solutions more generally by assessing the clustered edge weights (cost). Following the theoretical analysis and operationalisation of MapReduce variants of affinity clustering for distributed operation, the quality is assessed empirically using standard datasets with variants of linkage- and affinity-based algorithms, as well as k-means. In addition to the Rand index (as metric for clustering accuracy) the quality of algorithms is assessed based on the ratio of the detected clusters (with balanced cluster sizes considered favourable).