K-bMOM: a robust Lloyd-type clustering algorithm based on bootstrap Median-of-Means

Brunet-Saumard, Camille, Genetay, Edouard, Saumard, Adrien

arXiv.org Machine Learning 

Data scientists have nowadays to deal with massive and complex datasets, that are often corrupted by outliers. Classical data mining procedures such as K-means or more general EM algorithms for instance are however sensitive to the presence of outliers, which can induce a time consuming pre-processing of the data. In this context, robust versions of data mining procedures are particularly relevant and we investigate a way to produce a Lloyd-type algorithm for hard clustering that is robust to the presence of ouliers. To do this, we propose to use a variant of median-of-means (MOM) statistics, that we call bootstrap median-of-means (bMOM). MOM principle has been the object of recent active research in mean estimation, regression, highdimensional framework and also supervised classification and machine learning ([17, 9, 15, 16, 19, 18, 20, 22]). Note that other approaches to robustness for K-means exist in the literature, such as for instance K-median or trimmed K-means (see for instance the survey [10] and references therein; see also [5]). Given a dataset, the boostrap median-of-means consists in first generating a (large) bootstrap sample and then perform a classical median-of-means on this bootstrap sample. We prove in Section 2 that if enough blocks are generated from the bootstrap sampling, then for a fixed block size, bMOM has a higher breakdown point than MOM.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found