Unsupervised Feature Selection for the k -means Clustering Problem

Neural Information Processing Systems 

We present a novel feature selection algorithm for the k -means clustering problem. Our algorithm is randomized and, assuming an accuracy parameter \epsilon \in (0,1), selects and appropriately rescales in an unsupervised manner \Theta(k \log(k / \epsilon) / \epsilon 2) features from a dataset of arbitrary dimensions. We prove that, if we run any \gamma -approximate k -means algorithm ( \gamma \geq 1) on the features selected using our method, we can find a (1 (1 \epsilon)\gamma) -approximate partition with high probability.