k-Means Clustering of Lines for Big Data

Marom, Yair, Feldman, Dan

Neural Information Processing Systems 

The input to the \emph{$k$-mean for lines} problem is a set $L$ of $n$ lines in $\mathbb{R} d$, and the goal is to compute a set of $k$ centers (points) in $\mathbb{R} d$ that minimizes the sum of squared distances over every line in $L$ and its nearest center. This is a straightforward generalization of the $k$-mean problem where the input is a set of $n$ points instead of lines. We suggest the first PTAS that computes a $(1 \epsilon)$-approximation to this problem in time $O(n \log n)$ for any constant approximation error $\epsilon \in (0, 1)$, and constant integers $k, d \geq 1$. This is by proving that there is always a weighted subset (called coreset) of $dk {O(k)}\log (n)/\epsilon 2$ lines in $L$ that approximates the sum of squared distances from $L$ to \emph{any} given set of $k$ points. Using traditional merge-and-reduce technique, this coreset implies results for a streaming set (possibly infinite) of lines to $M$ machines in one pass (e.g.