Subspace clustering in high-dimensions: Phase transitions & Statistical-to-Computational gap

Pesce, Luca, Loureiro, Bruno, Krzakala, Florent, Zdeborová, Lenka

Dec-1-2022–arXiv.org Artificial Intelligence

With the growing size of modern data, clustering techniques play an important role in reducing the dimensionality of the features used in modern Machine Learning pipelines. Indeed, in many tasks of interest ranging from DNA sequence analysis to image classification, the relevant features are known to live in a lower-dimensional space (intrinsic dimension) than their raw acquisition format (extrinsic dimension) [1]. In these cases, identifying these features can help saving computational resources while significantly improving on learning performance. But given a corrupted embedding of low-dimensional features in a high-dimensional space, is it always statistically possible to retrieve them? And if yes - how can reconstruction be achieved efficiently in practice? In this manuscript we address these two fundamental questions in a simple model for subspace clustering: a k-cluster Gaussian mixture model with sparse centroids. In this model, the low-dimensional hidden features are given by the sparse centroids, which are embedded in a higher dimensional space and corrupted by additive Gaussian noise. We assume that the number of non-zero components of the centroids as well as the number of samples scales linearly with the dimension of the embedding space. Given a finite sample from the mixture, the goal of the statistician is to cluster the data, i.e. estimate the centroids (or features) as well as possible.

algorithm, artificial intelligence, machine learning, (14 more...)

arXiv.org Artificial Intelligence

Dec-1-2022

arXiv.org PDF

Add feedback

Country:
- Africa > Middle East
  - Tunisia > Ben Arous Governorate > Ben Arous (0.04)
- Asia > Middle East
  - Jordan (0.04)
- Europe
  - France (0.04)
  - Netherlands > North Holland
    - Amsterdam (0.04)
  - Switzerland > Vaud
    - Lausanne (0.04)
- North America > United States (0.28)

Genre:
- Research Report (0.82)

Industry:
- Government > Regional Government (0.46)
- Health & Medicine > Pharmaceuticals & Biotechnology (0.54)

Technology:
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.87)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found