Simple, Scalable and Effective Clustering via One-Dimensional Projections

Neural Information Processing Systems 

Clustering is a fundamental problem in unsupervised machine learning with many applications in data analysis. Popular clustering algorithms such as Lloyd's algorithm and k -means can take \Omega(ndk) time when clustering n points in a d -dimensional space (represented by an n\times d matrix X) into k clusters. On massive datasets with moderate to large k, the multiplicative k factor can become very expensive. We introduce a simple randomized clustering algorithm that provably runs in expected time O(\mathsf{nnz}(X) n\log n) for arbitrary k . Here \mathsf{nnz}(X) is the total number of non-zero entries in the input dataset X, which is upper bounded by nd and can be significantly smaller for sparse datasets.