Goto

Collaborating Authors

 Dan Feldman


k-Means Clustering of Lines for Big Data

Neural Information Processing Systems

This is a straightforward generalization of the k-mean problem where the input is a set of n points instead of lines. We suggest the first PTAS that computes a (1 + ε)-approximation to this problem in time O(n log n) for any constant approximation error ε (0, 1), and constant integers k, d 1.


Fast and Accurate Least-Mean-Squares Solvers

Neural Information Processing Systems

Least-mean squares (LMS) solvers such as Linear / Ridge / Lasso-Regression, SVD and Elastic-Net not only solve fundamental machine learning problems, but are also the building blocks in a variety of other methods, such as decision trees and matrix factorizations. We suggest an algorithm that gets a finite set of n d-dimensional real vectors and returns a weighted subset of d + 1 vectors whose sum is exactly the same. The proof in Caratheodory's Theorem (1907) computes such a subset in O(n


Dimensionality Reduction of Massive Sparse Datasets Using Coresets

Neural Information Processing Systems

In this paper we present a practical solution with performance guarantees to the problem of dimensionality reduction for very large scale sparse matrices. We show applications of our approach to computing the Principle Component Analysis (PCA) of any n d matrix, using one pass over the stream of its rows. Our solution uses coresets: a scaled subset of the n rows that approximates their sum of squared distances to every k-dimensional affine subspace. An open theoretical problem has been to compute such a coreset that is independent of both n and d. An open practical problem has been to compute a non-trivial approximation to the PCA of very large but sparse databases such as the Wikipedia document-term matrix in a reasonable time. We answer both of these questions affirmatively. Our main technical result is a new framework for deterministic coreset constructions based on a reduction to the problem of counting items in a stream.


k-Means Clustering of Lines for Big Data

Neural Information Processing Systems

This is a straightforward generalization of the k-mean problem where the input is a set of n points instead of lines. We suggest the first PTAS that computes a (1 + ε)-approximation to this problem in time O(n log n) for any constant approximation error ε (0, 1), and constant integers k, d 1.


Fast and Accurate Least-Mean-Squares Solvers

Neural Information Processing Systems

Least-mean squares (LMS) solvers such as Linear / Ridge / Lasso-Regression, SVD and Elastic-Net not only solve fundamental machine learning problems, but are also the building blocks in a variety of other methods, such as decision trees and matrix factorizations. We suggest an algorithm that gets a finite set of n d-dimensional real vectors and returns a weighted subset of d + 1 vectors whose sum is exactly the same. The proof in Caratheodory's Theorem (1907) computes such a subset in O(n