A Unified Approach to Robust Mean Estimation

Prasad, Adarsh, Balakrishnan, Sivaraman, Ravikumar, Pradeep

Jul-1-2019–arXiv.org Artificial Intelligence

Modern data sets that arise in various branches of science and engineering are characterized by their ever increasing scale and richness. This is spurred in part by easier access to computer, internet and various sensor-based technologies that enable the collection of such varied datasets. But on the flip side, these large and rich data-sets are no longer carefully curated, are often collected in a decentralized, distributed fashion, and consequently are plagued with the complexities of heterogeneity, adversarial manipulations, and outliers. The analysis of these huge datasets is thus fraught with methodological challenges. To understand the fundamental challenges and tradeoffs in handling such "dirty data" is precisely the premise of the field of robust statistics. Here, the aforementioned complexities are largely formalized under two different models of robustness: (1) The heavy-tailed model: Here the sampling distribution can have thick tails, for instance, only low-order moments of the distribution are assumed to be finite; and (2) The ǫ-contamination model: Here the sampling distribution is modeled as a well-behaved distribution contaminated by an ǫ fraction of arbitrary outliers. In each case, classical estimators of the distribution (based for instance on the maximum likelihood estimator) can behave considerably worse (potentially arbitrarily worse) than under standard settings where the data is better behaved, satisfying various regularity properties. In particular, these classical estimators can be extremely sensitive to the tails of the distribution or to the outliers, so that the broad goal in robust statistics is to construct estimators that improve on these classical estimators by reducing their sensitivity to outliers.

artificial intelligence, estimator, machine learning, (16 more...)

arXiv.org Artificial Intelligence

Jul-1-2019

arXiv.org PDF

Add feedback

Country:
- North America > United States (0.28)

Genre:
- Research Report > New Finding (0.46)

Technology:
- Information Technology > Artificial Intelligence > Machine Learning (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found