A Unified Approach to Robust Mean Estimation
Prasad, Adarsh, Balakrishnan, Sivaraman, Ravikumar, Pradeep
–arXiv.org Artificial Intelligence
Modern data sets that arise in various branches of science and engineering are characterized by their ever increasing scale and richness. This is spurred in part by easier access to computer, internet and various sensor-based technologies that enable the collection of such varied datasets. But on the flip side, these large and rich data-sets are no longer carefully curated, are often collected in a decentralized, distributed fashion, and consequently are plagued with the complexities of heterogeneity, adversarial manipulations, and outliers. The analysis of these huge datasets is thus fraught with methodological challenges. To understand the fundamental challenges and tradeoffs in handling such "dirty data" is precisely the premise of the field of robust statistics. Here, the aforementioned complexities are largely formalized under two different models of robustness: (1) The heavy-tailed model: Here the sampling distribution can have thick tails, for instance, only low-order moments of the distribution are assumed to be finite; and (2) The ǫ-contamination model: Here the sampling distribution is modeled as a well-behaved distribution contaminated by an ǫ fraction of arbitrary outliers. In each case, classical estimators of the distribution (based for instance on the maximum likelihood estimator) can behave considerably worse (potentially arbitrarily worse) than under standard settings where the data is better behaved, satisfying various regularity properties. In particular, these classical estimators can be extremely sensitive to the tails of the distribution or to the outliers, so that the broad goal in robust statistics is to construct estimators that improve on these classical estimators by reducing their sensitivity to outliers.
arXiv.org Artificial Intelligence
Jul-1-2019