Practical Data-Dependent Metric Compression with Provable Guarantees

Piotr Indyk, Ilya Razenshteyn, Tal Wagner

Neural Information Processing Systems 

We introduce a new distance-preserving compact representation of multidimensional point-sets. Given n points in a d-dimensional space where each coordinate is represented using B bits (i.e., dB bits per point), it produces a representation of size O(d log(dB/ɛ) + log n) bits per point from which one can approximate the distances up to a factor of 1 ɛ. Our algorithm almost matches the recent bound of [6] while being much simpler. We compare our algorithm to Product Quantization (PQ) [7], a state of the art heuristic metric compression method. We evaluate both algorithms on several data sets: SIFT (used in [7]), MNIST [11], New York City taxi time series [4] and a synthetic one-dimensional data set embedded in a high-dimensional space. With appropriately tuned parameters, our algorithm produces representations that are comparable to or better than those produced by PQ, while having provable guarantees on its performance.