K-bMOM: a robust Lloyd-type clustering algorithm based on bootstrap Median-of-Means
Brunet-Saumard, Camille, Genetay, Edouard, Saumard, Adrien
Data scientists have nowadays to deal with massive and complex datasets, that are often corrupted by outliers. Classical data mining procedures such as K-means or more general EM algorithms for instance are however sensitive to the presence of outliers, which can induce a time consuming pre-processing of the data. In this context, robust versions of data mining procedures are particularly relevant and we investigate a way to produce a Lloyd-type algorithm for hard clustering that is robust to the presence of ouliers. To do this, we propose to use a variant of median-of-means (MOM) statistics, that we call bootstrap median-of-means (bMOM). MOM principle has been the object of recent active research in mean estimation, regression, highdimensional framework and also supervised classification and machine learning ([17, 9, 15, 16, 19, 18, 20, 22]). Note that other approaches to robustness for K-means exist in the literature, such as for instance K-median or trimmed K-means (see for instance the survey [10] and references therein; see also [5]). Given a dataset, the boostrap median-of-means consists in first generating a (large) bootstrap sample and then perform a classical median-of-means on this bootstrap sample. We prove in Section 2 that if enough blocks are generated from the bootstrap sampling, then for a fixed block size, bMOM has a higher breakdown point than MOM.
Feb-10-2020
- Country:
- North America > United States
- New York (0.04)
- New Jersey > Hudson County
- Hoboken (0.04)
- Europe > France
- Brittany > Ille-et-Vilaine > Rennes (0.04)
- North America > United States
- Genre:
- Research Report (0.50)
- Technology: