Lucic, Mario, Faulkner, Matthew, Krause, Andreas, Feldman, Dan

How can we train a statistical mixture model on a massive data set? In this work we show how to construct coresets for mixtures of Gaussians. A coreset is a weighted subset of the data, which guarantees that models fitting the coreset also provide a good fit for the original data set. We show that, perhaps surprisingly, Gaussian mixtures admit coresets of size polynomial in dimension and the number of mixture components, while being independent of the data set size. Hence, one can harness computationally intensive algorithms to compute a good approximation on a significantly smaller data set. More importantly, such coresets can be efficiently constructed both in distributed and streaming settings and do not impose restrictions on the data generating process. Our results rely on a novel reduction of statistical estimation to problems in computational geometry and new combinatorial complexity results for mixtures of Gaussians. Empirical evaluation on several real-world datasets suggests that our coreset-based approach enables significant reduction in training-time with negligible approximation error.

Bachem, Olivier, Lucic, Mario, Krause, Andreas

We investigate coresets - succinct, small summaries of large data sets - so that solutions found on the summary are provably competitive with solution found on the full data set. We provide an overview over the state-of-the-art in coreset construction for machine learning. In Section 2, we present both the intuition behind and a theoretically sound framework to construct coresets for general problems and apply it to $k$-means clustering. In Section 3 we summarize existing coreset construction algorithms for a variety of machine learning problems such as maximum likelihood estimation of mixture models, Bayesian non-parametric models, principal component analysis, regression and general empirical risk minimization.

Feldman, Dan, Kfir, Zahi, Wu, Xuan

An $\varepsilon$-coreset for a given set $D$ of $n$ points, is usually a small weighted set, such that querying the coreset \emph{provably} yields a $(1+\varepsilon)$-factor approximation to the original (full) dataset, for a given family of queries. Using existing techniques, coresets can be maintained for streaming, dynamic (insertion/deletions), and distributed data in parallel, e.g. on a network, GPU or cloud. We suggest the first coresets that approximate the negative log-likelihood for $k$-Gaussians Mixture Models (GMM) of arbitrary shapes (ratio between eigenvalues of their covariance matrices). For example, for any input set $D$ whose coordinates are integers in $[-n^{100},n^{100}]$ and any fixed $k,d\geq 1$, the coreset size is $(\log n)^{O(1)}/\varepsilon^2$, and can be computed in time near-linear in $n$, with high probability. The optimal GMM may then be approximated quickly by learning the small coreset. Previous results [NIPS'11, JMLR'18] suggested such small coresets for the case of semi-speherical unit Gaussians, i.e., where their corresponding eigenvalues are constants between $\frac{1}{2\pi}$ to $2\pi$. Our main technique is a reduction between coresets for $k$-GMMs and projective clustering problems. We implemented our algorithms, and provide open code, and experimental results. Since our coresets are generic, with no special dependency on GMMs, we hope that they will be useful for many other functions.

Bachem, Olivier, Lucic, Mario, Krause, Andreas

Coresets are compact representations of data sets such that models trained on a coreset are provably competitive with models trained on the full data set. As such, they have been successfully used to scale up clustering models to massive data sets. While existing approaches generally only allow for multiplicative approximation errors, we propose a novel notion of coresets called lightweight coresets that allows for both multiplicative and additive errors. We provide a single algorithm to construct light-weight coresets for k-Means clustering, Bregman clustering and maximum likelihood estimation of Gaussian mixture models. The algorithm is substantially faster than existing constructions, embarrassingly parallel and resulting coresets are smaller. In an extensive experimental evaluation, we demonstrate that the proposed method outperforms existing coreset constructions.

Huggins, Jonathan, Campbell, Trevor, Broderick, Tamara

The use of Bayesian methods in large-scale data settings is attractive because of the rich hierarchical models, uncertainty quantification, and prior specification they provide. Standard Bayesian inference algorithms are computationally expensive, however, making their direct application to large datasets difficult or infeasible. Recent work on scaling Bayesian inference has focused on modifying the underlying algorithms to, for example, use only a random data subsample at each iteration. We leverage the insight that data is often redundant to instead obtain a weighted subset of the data (called a coreset) that is much smaller than the original dataset. We can then use this small coreset in any number of existing posterior inference algorithms without modification. In this paper, we develop an efficient coreset construction algorithm for Bayesian logistic regression models. We provide theoretical guarantees on the size and approximation quality of the coreset -- both for fixed, known datasets, and in expectation for a wide class of data generative models. Crucially, the proposed approach also permits efficient construction of the coreset in both streaming and parallel settings, with minimal additional effort. We demonstrate the efficacy of our approach on a number of synthetic and real-world datasets, and find that, in practice, the size of the coreset is independent of the original dataset size. Furthermore, constructing the coreset takes a negligible amount of time compared to that required to run MCMC on it.