Scalable Training of Mixture Models via Coresets

How can we train a statistical mixture model on a massive data set? In this paper, we show how to construct coresets for mixtures of Gaussians and natural generalizations. A coreset is a weighted subset of the data, which guarantees that models fitting the coreset will also provide a good fit for the original data set. We show that, perhaps surprisingly, Gaussian mixtures admit coresets of size independent of the size of the data set. More precisely, we prove that a weighted set of $O(dk^3/\eps^2)$ data points suffices for computing a $(1+\eps)$-approximation for the optimal model on the original $n$ data points. Moreover, such coresets can be efficiently constructed in a map-reduce style computation, as well as in a streaming setting. Our results rely on a novel reduction of statistical estimation to problems in computational geometry, as well as new complexity results about mixtures of Gaussians. We empirically evaluate our algorithms on several real data sets, including a density estimation problem in the context of earthquake detection using accelerometers in mobile phones.

Training Gaussian Mixture Models at Scale via Coresets

How can we train a statistical mixture model on a massive data set? In this work we show how to construct coresets for mixtures of Gaussians. A coreset is a weighted subset of the data, which guarantees that models fitting the coreset also provide a good fit for the original data set. We show that, perhaps surprisingly, Gaussian mixtures admit coresets of size polynomial in dimension and the number of mixture components, while being independent of the data set size. Hence, one can harness computationally intensive algorithms to compute a good approximation on a significantly smaller data set. More importantly, such coresets can be efficiently constructed both in distributed and streaming settings and do not impose restrictions on the data generating process. Our results rely on a novel reduction of statistical estimation to problems in computational geometry and new combinatorial complexity results for mixtures of Gaussians. Empirical evaluation on several real-world datasets suggests that our coreset-based approach enables significant reduction in training-time with negligible approximation error.

Practical Coreset Constructions for Machine Learning

We investigate coresets - succinct, small summaries of large data sets - so that solutions found on the summary are provably competitive with solution found on the full data set. We provide an overview over the state-of-the-art in coreset construction for machine learning. In Section 2, we present both the intuition behind and a theoretically sound framework to construct coresets for general problems and apply it to $k$-means clustering. In Section 3 we summarize existing coreset construction algorithms for a variety of machine learning problems such as maximum likelihood estimation of mixture models, Bayesian non-parametric models, principal component analysis, regression and general empirical risk minimization.

Coresets for Gaussian Mixture Models of Any Shape

An $\varepsilon$-coreset for a given set $D$ of $n$ points, is usually a small weighted set, such that querying the coreset \emph{provably} yields a $(1+\varepsilon)$-factor approximation to the original (full) dataset, for a given family of queries. Using existing techniques, coresets can be maintained for streaming, dynamic (insertion/deletions), and distributed data in parallel, e.g. on a network, GPU or cloud. We suggest the first coresets that approximate the negative log-likelihood for $k$-Gaussians Mixture Models (GMM) of arbitrary shapes (ratio between eigenvalues of their covariance matrices). For example, for any input set $D$ whose coordinates are integers in $[-n^{100},n^{100}]$ and any fixed $k,d\geq 1$, the coreset size is $(\log n)^{O(1)}/\varepsilon^2$, and can be computed in time near-linear in $n$, with high probability. The optimal GMM may then be approximated quickly by learning the small coreset. Previous results [NIPS'11, JMLR'18] suggested such small coresets for the case of semi-speherical unit Gaussians, i.e., where their corresponding eigenvalues are constants between $\frac{1}{2\pi}$ to $2\pi$. Our main technique is a reduction between coresets for $k$-GMMs and projective clustering problems. We implemented our algorithms, and provide open code, and experimental results. Since our coresets are generic, with no special dependency on GMMs, we hope that they will be useful for many other functions.

On Activation Function Coresets for Network Pruning

Model compression provides a means to efficiently deploy deep neural networks (DNNs) on devices that limited computation resources and tight power budgets, such as mobile and IoT (Internet of Things) devices. Consequently, model compression is one of the most critical topics in modern deep learning. Typically, the state-of-the-art model compression methods suffer from a big limitation: they are only based on heuristics rather than theoretical foundation and thus offer no worst-case guarantees. To bridge this gap, Baykal et. al. [2018a] suggested using a coreset, a small weighted subset of the data that provably approximates the original data set, to sparsify the parameters of a trained fully-connected neural network by sampling a number of neural network parameters based on the importance of the data. However, the sampling procedure is data-dependent and can only be only be performed after an expensive training phase. We propose the use of data-independent coresets to perform provable model compression without the need for training. We first prove that there exists a coreset whose size is independent of the input size of the data for any neuron whose activation function is from a family of functions that includes variants of ReLU, sigmoid and others. We then provide a compression-based algorithm that constructs these coresets and explicitly applies neuron pruning for the underlying model. We demonstrate the effectiveness of our methods with experimental evaluations for both synthetic and real-world benchmark network compression. In particular, our framework provides up to 90% compression on the LeNet-300-100 architecture on MNIST and actually improves the accuracy.