K-bMOM: a robust Lloyd-type clustering algorithm based on bootstrap Median-of-Means

Brunet-Saumard, Camille, Genetay, Edouard, Saumard, Adrien

Feb-10-2020–arXiv.org Machine Learning

Data scientists have nowadays to deal with massive and complex datasets, that are often corrupted by outliers. Classical data mining procedures such as K-means or more general EM algorithms for instance are however sensitive to the presence of outliers, which can induce a time consuming pre-processing of the data. In this context, robust versions of data mining procedures are particularly relevant and we investigate a way to produce a Lloyd-type algorithm for hard clustering that is robust to the presence of ouliers. To do this, we propose to use a variant of median-of-means (MOM) statistics, that we call bootstrap median-of-means (bMOM). MOM principle has been the object of recent active research in mean estimation, regression, highdimensional framework and also supervised classification and machine learning ([17, 9, 15, 16, 19, 18, 20, 22]). Note that other approaches to robustness for K-means exist in the literature, such as for instance K-median or trimmed K-means (see for instance the survey [10] and references therein; see also [5]). Given a dataset, the boostrap median-of-means consists in first generating a (large) bootstrap sample and then perform a classical median-of-means on this bootstrap sample. We prove in Section 2 that if enough blocks are generated from the bootstrap sampling, then for a fixed block size, bMOM has a higher breakdown point than MOM.

algorithm, breakdown point, outlier, (15 more...)

arXiv.org Machine Learning

Feb-10-2020

arXiv.org PDF

Add feedback

Country:
- North America > United States
  - New York (0.04)
  - New Jersey > Hudson County
    - Hoboken (0.04)
- Europe > France
  - Brittany > Ille-et-Vilaine > Rennes (0.04)

Genre:
- Research Report (0.50)

Technology:
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found