AITopics | Saulpic, David

Collaborating Authors

Saulpic, David

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

A Tight VC-Dimension Analysis of Clustering Coresets with Applications

Cohen-Addad, Vincent, Draganov, Andrew, Russo, Matteo, Saulpic, David, Schwiegelshohn, Chris

arXiv.org Artificial IntelligenceJan-11-2025

We consider coresets for $k$-clustering problems, where the goal is to assign points to centers minimizing powers of distances. A popular example is the $k$-median objective $\sum_{p}\min_{c\in C}dist(p,C)$. Given a point set $P$, a coreset $\Omega$ is a small weighted subset that approximates the cost of $P$ for all candidate solutions $C$ up to a $(1\pm\varepsilon )$ multiplicative factor. In this paper, we give a sharp VC-dimension based analysis for coreset construction. As a consequence, we obtain improved $k$-median coreset bounds for the following metrics: Coresets of size $\tilde{O}\left(k\varepsilon^{-2}\right)$ for shortest path metrics in planar graphs, improving over the bounds $\tilde{O}\left(k\varepsilon^{-6}\right)$ by [Cohen-Addad, Saulpic, Schwiegelshohn, STOC'21] and $\tilde{O}\left(k^2\varepsilon^{-4}\right)$ by [Braverman, Jiang, Krauthgamer, Wu, SODA'21]. Coresets of size $\tilde{O}\left(kd\ell\varepsilon^{-2}\log m\right)$ for clustering $d$-dimensional polygonal curves of length at most $m$ with curves of length at most $\ell$ with respect to Frechet metrics, improving over the bounds $\tilde{O}\left(k^3d\ell\varepsilon^{-3}\log m\right)$ by [Braverman, Cohen-Addad, Jiang, Krauthgamer, Schwiegelshohn, Toftrup, and Wu, FOCS'22] and $\tilde{O}\left(k^2d\ell\varepsilon^{-2}\log m \log |P|\right)$ by [Conradi, Kolbe, Psarros, Rohde, SoCG'24].

artificial intelligence, machine learning, tight vc-dimension analysis, (2 more...)

arXiv.org Artificial Intelligence

2501.06588

Genre: Research Report (0.40)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Computational Learning Theory (0.60)

Add feedback

Almost-linear Time Approximation Algorithm to Euclidean $k$-median and $k$-means

la Tour, Max Dupré, Saulpic, David

arXiv.org Artificial IntelligenceJul-15-2024

The k-means objective function was introduced by Lloyd in 1957 (and published later in [Llo82]) as a measure of the quality of compression. Given a set of points P and an integer k, minimizing the k-means objective yields a set of k centers that provide a good compressed representation of the original dataset P. Lloyd's original motivation was to compress analog audio signals into numerical ones: numerical signals have to be discrete, and Lloyd proposed a method to find which frequencies should be kept in the discretization. His method was a heuristic trying to minimize what he called the quantization error, which is the sum, for each point, of the squared distance to its representative. This is precisely the k-means cost, and the goal of the k-means problem is to find the set of k representatives (or centers) that minimizes this cost. In contrast, the k-median cost function is the sum, for each point, of the distance to its closest center, inherently giving less weight to the outliers in the dataset.

algorithm, artificial intelligence, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2407.11217

Country: North America > United States > Louisiana (0.14)

Genre:

Research Report (0.50)
Workflow (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)

Add feedback

Making Old Things New: A Unified Algorithm for Differentially Private Clustering

la Tour, Max Dupré, Henzinger, Monika, Saulpic, David

arXiv.org Artificial IntelligenceJun-17-2024

As a staple of data analysis and unsupervised learning, the problem of private clustering has been widely studied under various privacy models. Centralized differential privacy is the first of them, and the problem has also been studied for the local and the shuffle variation. In each case, the goal is to design an algorithm that computes privately a clustering, with the smallest possible error. The study of each variation gave rise to new algorithms: the landscape of private clustering algorithms is therefore quite intricate. In this paper, we show that a 20-year-old algorithm can be slightly modified to work for any of these models. This provides a unified picture: while matching almost all previously known results, it allows us to improve some of them and extend it to a new privacy model, the continual observation setting, where the input is changing over time and the algorithm must output a new solution at each time step.

artificial intelligence, data mining, machine learning, (20 more...)

arXiv.org Artificial Intelligence

2406.11649

Country:

Europe (0.92)
North America > United States > California (0.27)
North America > Canada > Quebec > Montreal (0.14)

Genre: Research Report (1.00)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.66)

Add feedback

Settling Time vs. Accuracy Tradeoffs for Clustering Big Data

Draganov, Andrew, Saulpic, David, Schwiegelshohn, Chris

arXiv.org Artificial IntelligenceApr-2-2024

We study the theoretical and practical runtime limits of k-means and k-median clustering on large datasets. Since effectively all clustering methods are slower than the time it takes to read the dataset, the fastest approach is to quickly compress the data and perform the clustering on the compressed representation. Unfortunately, there is no universal best choice for compressing the number of points - while random sampling runs in sublinear time and coresets provide theoretical guarantees, the former does not enforce accuracy while the latter is too slow as the numbers of points and clusters grow. Indeed, it has been conjectured that any sensitivity-based coreset construction requires super-linear time in the dataset size. We examine this relationship by first showing that there does exist an algorithm that obtains coresets via sensitivity sampling in effectively linear time - within log-factors of the time it takes to read the data. Any approach that significantly improves on this must then resort to practical heuristics, leading us to consider the spectrum of sampling strategies across both real and artificial datasets in the static and streaming settings. Through this, we show the conditions in which coresets are necessary for preserving cluster validity as well as the settings in which faster, cruder sampling strategies are sufficient. As a result, we provide a comprehensive theoretical and practical blueprint for effective clustering regardless of data size. Our code is publicly available and has scripts to recreate the experiments.

artificial intelligence, data mining, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2404.01936

Country:

Europe (1.00)
North America > United States > California (0.46)
North America > United States > Arizona > Maricopa County > Phoenix (0.14)

Genre: Research Report (0.50)

Industry: Transportation (0.46)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

Data-Efficient Learning via Clustering-Based Sensitivity Sampling: Foundation Models and Beyond

Axiotis, Kyriakos, Cohen-Addad, Vincent, Henzinger, Monika, Jerome, Sammy, Mirrokni, Vahab, Saulpic, David, Woodruff, David, Wunder, Michael

arXiv.org Artificial IntelligenceFeb-27-2024

We study the data selection problem, whose aim is to select a small representative subset of data that can be used to efficiently train a machine learning model. We present a new data selection approach based on $k$-means clustering and sensitivity sampling. Assuming access to an embedding representation of the data with respect to which the model loss is H\"older continuous, our approach provably allows selecting a set of ``typical'' $k + 1/\varepsilon^2$ elements whose average loss corresponds to the average loss of the whole dataset, up to a multiplicative $(1\pm\varepsilon)$ factor and an additive $\varepsilon \lambda \Phi_k$, where $\Phi_k$ represents the $k$-means cost for the input embeddings and $\lambda$ is the H\"older constant. We furthermore demonstrate the performance and scalability of our approach on fine-tuning foundation models and show that it outperforms state-of-the-art methods. We also show how it can be applied on linear regression, leading to a new sampling strategy that surprisingly matches the performances of leverage score sampling, while being conceptually simpler and more scalable.

artificial intelligence, clustering-based sensitivity sampling, machine learning, (3 more...)

arXiv.org Artificial Intelligence

2402.17327

Genre: Research Report (0.69)

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.87)

Add feedback

Differential Privacy for Clustering Under Continual Observation

la Tour, Max Dupré, Henzinger, Monika, Saulpic, David

arXiv.org Artificial IntelligenceJul-27-2023

We consider the problem of clustering privately a dataset in $\mathbb{R}^d$ that undergoes both insertion and deletion of points. Specifically, we give an $\varepsilon$-differentially private clustering mechanism for the $k$-means objective under continual observation. This is the first approximation algorithm for that problem with an additive error that depends only logarithmically in the number $T$ of updates. The multiplicative error is almost the same as non privately. To do so we show how to perform dimension reduction under continual observation and combine it with a differentially private greedy approximation algorithm for $k$-means. We also partially extend our results to the $k$-median problem.

artificial intelligence, data mining, machine learning, (20 more...)

arXiv.org Artificial Intelligence

2307.0343

Country:

Europe (0.92)
North America > United States > Texas (0.14)
North America > United States > California (0.14)
North America > Canada > Quebec > Montreal (0.14)

Genre: Research Report (1.00)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Data Science > Data Mining (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.49)

Add feedback

Improved Coresets for Euclidean $k$-Means

Cohen-Addad, Vincent, Larsen, Kasper Green, Saulpic, David, Schwiegelshohn, Chris, Sheikh-Omar, Omar Ali

arXiv.org Artificial IntelligenceNov-16-2022

Given a set of $n$ points in $d$ dimensions, the Euclidean $k$-means problem (resp. the Euclidean $k$-median problem) consists of finding $k$ centers such that the sum of squared distances (resp. sum of distances) from every point to its closest center is minimized. The arguably most popular way of dealing with this problem in the big data setting is to first compress the data by computing a weighted subset known as a coreset and then run any algorithm on this subset. The guarantee of the coreset is that for any candidate solution, the ratio between coreset cost and the cost of the original instance is less than a $(1\pm \varepsilon)$ factor. The current state of the art coreset size is $\tilde O(\min(k^{2} \cdot \varepsilon^{-2},k\cdot \varepsilon^{-4}))$ for Euclidean $k$-means and $\tilde O(\min(k^{2} \cdot \varepsilon^{-2},k\cdot \varepsilon^{-3}))$ for Euclidean $k$-median. The best known lower bound for both problems is $\Omega(k \varepsilon^{-2})$. In this paper, we improve the upper bounds $\tilde O(\min(k^{3/2} \cdot \varepsilon^{-2},k\cdot \varepsilon^{-4}))$ for $k$-means and $\tilde O(\min(k^{4/3} \cdot \varepsilon^{-2},k\cdot \varepsilon^{-3}))$ for $k$-median. In particular, ours is the first provable bound that breaks through the $k^2$ barrier while retaining an optimal dependency on $\varepsilon$.

artificial intelligence, data mining, machine learning, (15 more...)

arXiv.org Artificial Intelligence

2211.08184

Country:

Europe (1.00)
North America > Canada (0.68)
North America > United States > California (0.67)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Data Science > Data Mining (0.88)

Add feedback