Goto

Collaborating Authors

coreset


Few-Shot Learning with fast.ai

#artificialintelligence

Lately, posts and tutorials about new deep learning architectures and training strategies have dominated the community. However, one very interesting research area, namely few-shot learning, is not getting the attention it deserves. If we want widespread adoption of ML we need to find ways to train them efficiently, with little data and code. In this tutorial, we will go through a Google Colab Notebook to train an image classification model using only 5 labeled samples per class. Using only 5 exemplary samples is also called 5-shot learning.


k-Means Clustering of Lines for Big Data

Neural Information Processing Systems

The input to the \emph{$k$-mean for lines} problem is a set $L$ of $n$ lines in $\mathbb{R} d$, and the goal is to compute a set of $k$ centers (points) in $\mathbb{R} d$ that minimizes the sum of squared distances over every line in $L$ and its nearest center. This is a straightforward generalization of the $k$-mean problem where the input is a set of $n$ points instead of lines. We suggest the first PTAS that computes a $(1 \epsilon)$-approximation to this problem in time $O(n \log n)$ for any constant approximation error $\epsilon \in (0, 1)$, and constant integers $k, d \geq 1$. This is by proving that there is always a weighted subset (called coreset) of $dk {O(k)}\log (n)/\epsilon 2$ lines in $L$ that approximates the sum of squared distances from $L$ to \emph{any} given set of $k$ points. Using traditional merge-and-reduce technique, this coreset implies results for a streaming set (possibly infinite) of lines to $M$ machines in one pass (e.g.


Coresets for Clustering with Fairness Constraints

Neural Information Processing Systems

In a recent work, \cite{chierichetti2017fair} studied the following fair'' variants of classical clustering problems such as k-means and k-median: given a set of n data points in R d and a binary type associated to each data point, the goal is to cluster the points while ensuring that the proportion of each type in each cluster is roughly the same as its underlying proportion. Subsequent work has focused on either extending this setting to when each data point has multiple, non-disjoint sensitive types such as race and gender \cite{bera2019fair}, or to address the problem that the clustering algorithms in the above work do not scale well. The main contribution of this paper is an approach to clustering with fairness constraints that involve {\em multiple, non-disjoint} attributes, that is {\em also scalable}. Our approach is based on novel constructions of coresets: for the k-median objective, we construct an \eps-coreset of size O(\Gamma k 2 \eps {-d}) where \Gamma is the number of distinct collections of groups that a point may belong to, and for the k-means objective, we show how to construct an \eps-coreset of size O(\Gamma k 3\eps {-d-1}). The former result is the first known coreset construction for the fair clustering problem with the k-median objective, and the latter result removes the dependence on the size of the full dataset as in \cite{schmidt2018fair} and generalizes it to multiple, non-disjoint attributes.


Coresets for Archetypal Analysis

Neural Information Processing Systems

Archetypal analysis represents instances as linear mixtures of prototypes (the archetypes) that lie on the boundary of the convex hull of the data. Archetypes are thus often better interpretable than factors computed by other matrix factorization techniques. However, the interpretability comes with high computational cost due to additional convexity-preserving constraints. In this paper, we propose efficient coresets for archetypal analysis. Theoretical guarantees are derived by showing that quantization errors of k-means upper bound archetypal analysis; the computation of a provable absolute-coreset can be performed in only two passes over the data.


Sets Clustering

arXiv.org Machine Learning

The input to the \emph{sets-$k$-means} problem is an integer $k\geq 1$ and a set $\mathcal{P}=\{P_1,\cdots,P_n\}$ of sets in $\mathbb{R}^d$. The goal is to compute a set $C$ of $k$ centers (points) in $\mathbb{R}^d$ that minimizes the sum $\sum_{P\in \mathcal{P}} \min_{p\in P, c\in C}\left\| p-c \right\|^2$ of squared distances to these sets. An \emph{$\varepsilon$-core-set} for this problem is a weighted subset of $\mathcal{P}$ that approximates this sum up to $1\pm\varepsilon$ factor, for \emph{every} set $C$ of $k$ centers in $\mathbb{R}^d$. We prove that such a core-set of $O(\log^2{n})$ sets always exists, and can be computed in $O(n\log{n})$ time, for every input $\mathcal{P}$ and every fixed $d,k\geq 1$ and $\varepsilon \in (0,1)$. The result easily generalized for any metric space, distances to the power of $z>0$, and M-estimators that handle outliers. Applying an inefficient but optimal algorithm on this coreset allows us to obtain the first PTAS ($1+\varepsilon$ approximation) for the sets-$k$-means problem that takes time near linear in $n$. This is the first result even for sets-mean on the plane ($k=1$, $d=2$). Open source code and experimental results for document classification and facility locations are also provided.


Uncovering Coresets for Classification With Multi-Objective Evolutionary Algorithms

arXiv.org Machine Learning

A coreset is a subset of the training set, using which a machine learning algorithm obtains performances similar to what it would deliver if trained over the whole original data. Coreset discovery is an active and open line of research as it allows improving training speed for the algorithms and may help human understanding the results. Building on previous works, a novel approach is presented: candidate corsets are iteratively optimized, adding and removing samples. As there is an obvious trade-off between limiting training size and quality of the results, a multi-objective evolutionary algorithm is used to minimize simultaneously the number of points in the set and the classification error. Experimental results on non-trivial benchmarks show that the proposed approach is able to deliver results that allow a classifier to obtain lower error and better ability of generalizing on unseen data than state-of-the-art coreset discovery techniques.


On Coresets for Support Vector Machines

arXiv.org Machine Learning

We present an efficient coreset construction algorithm for large-scale Support Vector Machine (SVM) training in Big Data and streaming applications. A coreset is a small, representative subset of the original data points such that a models trained on the coreset are provably competitive with those trained on the original data set. Since the size of the coreset is generally much smaller than the original set, our preprocess-then-train scheme has potential to lead to significant speedups when training SVM models. We prove lower and upper bounds on the size of the coreset required to obtain small data summaries for the SVM problem. As a corollary, we show that our algorithm can be used to extend the applicability of any off-the-shelf SVM solver to streaming, distributed, and dynamic data settings. We evaluate the performance of our algorithm on real-world and synthetic data sets. Our experimental results reaffirm the favorable theoretical properties of our algorithm and demonstrate its practical effectiveness in accelerating SVM training.


Scalable Training of Mixture Models via Coresets

Neural Information Processing Systems

How can we train a statistical mixture model on a massive data set? In this paper, we show how to construct coresets for mixtures of Gaussians and natural generalizations. A coreset is a weighted subset of the data, which guarantees that models fitting the coreset will also provide a good fit for the original data set. We show that, perhaps surprisingly, Gaussian mixtures admit coresets of size independent of the size of the data set. More precisely, we prove that a weighted set of $O(dk 3/\eps 2)$ data points suffices for computing a $(1 \eps)$-approximation for the optimal model on the original $n$ data points.


On Coresets for Logistic Regression

Neural Information Processing Systems

Coresets are one of the central methods to facilitate the analysis of large data. We continue a recent line of research applying the theory of coresets to logistic regression. First, we show the negative result that no strongly sublinear sized coresets exist for logistic regression. To deal with intractable worst-case instances we introduce a complexity measure $\mu(X)$, which quantifies the hardness of compressing a data set for logistic regression. For data sets with bounded $\mu(X)$-complexity, we show that a novel sensitivity sampling scheme produces the first provably sublinear $(1\pm\eps)$-coreset.


Distributed k-means and k-median Clustering on General Topologies

Neural Information Processing Systems

This paper provides new algorithms for distributed clustering for two popular center-based objectives, $k$-median and $k$-means. These algorithms have provable guarantees and improve communication complexity over existing approaches. Following a classic approach in clustering by \cite{har2004coresets}, we reduce the problem of finding a clustering with low cost to the problem of finding a coreset' of small size. We provide a distributed method for constructing a global coreset which improves over the previous methods by reducing the communication complexity, and which works over general communication topologies. We provide experimental evidence for this approach on both synthetic and real data sets.