k-Means Clustering of Lines for Big Data

Neural Information Processing Systems 

The input to the \emph{ k -mean for lines} problem is a set L of n lines in \mathbb{R} d, and the goal is to compute a set of k centers (points) in \mathbb{R} d that minimizes the sum of squared distances over every line in L and its nearest center. This is a straightforward generalization of the k -mean problem where the input is a set of n points instead of lines. We suggest the first PTAS that computes a (1 \epsilon) -approximation to this problem in time O(n \log n) for any constant approximation error \epsilon \in (0, 1), and constant integers k, d \geq 1 . This is by proving that there is always a weighted subset (called coreset) of dk {O(k)}\log (n)/\epsilon 2 lines in L that approximates the sum of squared distances from L to \emph{any} given set of k points. Using traditional merge-and-reduce technique, this coreset implies results for a streaming set (possibly infinite) of lines to M machines in one pass (e.g.