Goto

Collaborating Authors

 contamination model


Optimal Shrinkage of Singular Values Under Random Data Contamination

Neural Information Processing Systems

A low rank matrix X has been contaminated by uniformly distributed noise, missing values, outliers and corrupt entries. Reconstruction of X from the singular values and singular vectors of the contaminated matrix Y is a key problem in machine learning, computer vision and data science. In this paper we show that common contamination models (including arbitrary combinations of uniform noise, missing values, outliers and corrupt entries) can be described efficiently using a single framework. We develop an asymptotically optimal algorithm that estimates X by manipulation of the singular values of Y, which applies to any of the contamination models considered. Finally, we find an explicit signal-to-noise cutoff, below which estimation of X from the singular value decomposition of Y must fail, in a well-defined sense.


1ae6464c6b5d51b363d7d96f97132c75-Paper.pdf

Neural Information Processing Systems

Robust learning is a critical field that seeks to develop efficient algorithms that can recover an underlying model despite possibly malicious corruptions in the data. In recent decades, being able to deal with corrupted measurements has become of crucial importance.


On Learning Ising Models under Huber's Contamination Model

Neural Information Processing Systems

We study the problem of learning Ising models in a setting where some of the samples from the underlying distribution can be arbitrarily corrupted. In such a setup, we aim to design statistically optimal estimators in a high-dimensional scaling in which the number of nodes p, the number of edges k and the maximal node degree d are allowed to increase to infinity as a function of the sample size n. Our analysis is based on exploiting moments of the underlying distribution, coupled with novel reductions to univariate estimation. Our proposed estimators achieve an optimal dimension independent dependence on the fraction of corrupted data in the contaminated setting, while also simultaneously achieving high-probability error guarantees with optimal sample-complexity. We corroborate our theoretical results by simulations.


Optimal Shrinkage of Singular Values Under Random Data Contamination

Neural Information Processing Systems

A low rank matrix X has been contaminated by uniformly distributed noise, missing values, outliers and corrupt entries. Reconstruction of X from the singular values and singular vectors of the contaminated matrix Y is a key problem in machine learning, computer vision and data science. In this paper we show that common contamination models (including arbitrary combinations of uniform noise, missing values, outliers and corrupt entries) can be described efficiently using a single framework. We develop an asymptotically optimal algorithm that estimates X by manipulation of the singular values of Y, which applies to any of the contamination models considered. Finally, we find an explicit signal-to-noise cutoff, below which estimation of X from the singular value decomposition of Y must fail, in a well-defined sense.



Outlier Robust Mean Estimation with Subgaussian Rates via Stability

Neural Information Processing Systems

We consider a standard stability condition from the recent robust statistics literature and prove that, except with exponentially small failure probability, there exists a large fraction of the inliers satisfying this condition.


On Learning Ising Models under Huber's Contamination Model

Neural Information Processing Systems

We study the problem of learning Ising models in a setting where some of the samples from the underlying distribution can be arbitrarily corrupted. In such a setup, we aim to design statistically optimal estimators in a high-dimensional scaling in which the number of nodes p, the number of edges k and the maximal node degree d are allowed to increase to infinity as a function of the sample size n. Our analysis is based on exploiting moments of the underlying distribution, coupled with novel reductions to univariate estimation. Our proposed estimators achieve an optimal dimension independent dependence on the fraction of corrupted data in the contaminated setting, while also simultaneously achieving high-probability error guarantees with optimal sample-complexity. We corroborate our theoretical results by simulations.


Finite sample properties of parametric MMD estimation: robustness to misspecification and dependence

Chérief-Abdellatif, Badr-Eddine, Alquier, Pierre

arXiv.org Artificial Intelligence

One of the main challenges in statistics is the design of a universal estimation procedure. Given data, a universal procedure is an algorithm that provides an estimator of the generating distribution which is simultaneously statistically consistent when the true distribution belongs to the model, and robust otherwise. Typically, a universal estimator is consistent for any model, with minimaxoptimal or fast rates of convergence and is robust to small departures from the model assumptions [Bickel, 1976] such as sparse instead of dense effects or non-Gaussian errors in high dimensional linear regression. Unfortunately, most statistical procedures are based upon strong assumptions on the model or on the corresponding parameter set, and very famous estimation methods such as maximum likelihood estimation (MLE), method of moments or Bayesian posterior inference may fail even on simple problems when such assumptions do not hold. For instance, even though MLE is consistent and asymptotically normal with optimal rates of convergence in parametric estimation under suitable regularity assumptions [Le Cam, 1970, Van der Vaart, 1990] and in nonparametric estimation under entropy conditions, this method behaves poorly in case of misspecification when the true generating distribution of the data does not belong to the chosen model. Let us investigate a simple example presented in [Birgé, 2006] that illustrates the non-universal characteristic of MLE. We observe a collection of n independent and identically distributed (i.i.d) random variables X


Noise-Adaptive Conformal Classification with Marginal Coverage

Bortolotti, Teresa, Wang, Y. X. Rachel, Tong, Xin, Menafoglio, Alessandra, Vantini, Simone, Sesia, Matteo

arXiv.org Machine Learning

Conformal inference seeks rigorous uncertainty quantification for the predictions of any black-box machine learning model, without requiring parametric assumptions (Vovk et al., 2005). In classification, these methods aim to construct a prediction set for the label of a new test point while guaranteeing a specified coverage level. The split-conformal approach achieves this by leveraging residuals (or non-conformity scores) from a pre-trained model applied to an independent calibration data set, assuming exchangeability with the test data. Perfect exchangeability, however, may not always hold in practice, due for example to possible distribution shifts between the available data and the future test points of interest, creating a need to relax the assumptions underlying conformal inference (Barber et al., 2023).


Heavy-tailed Contamination is Easier than Adversarial Contamination

Cherapanamjeri, Yeshwanth, Lee, Daniel

arXiv.org Machine Learning

A large body of work in the statistics and computer science communities dating back to Huber (Huber, 1960) has led to statistically and computationally efficient outlier-robust estimators. Two particular outlier models have received significant attention: the adversarial and heavy-tailed models. While the former models outliers as the result of a malicious adversary manipulating the data, the latter relaxes distributional assumptions on the data allowing outliers to naturally occur as part of the data generating process. In the first setting, the goal is to develop estimators robust to the largest fraction of outliers while in the second, one seeks estimators to combat the loss of statistical efficiency, where the dependence on the failure probability is paramount. Despite these distinct motivations, the algorithmic approaches to both these settings have converged, prompting questions on the relationship between the models. In this paper, we investigate and provide a principled explanation for this phenomenon. First, we prove that any adversarially robust estimator is also resilient to heavy-tailed outliers for any statistical estimation problem with i.i.d data. As a corollary, optimal adversarially robust estimators for mean estimation, linear regression, and covariance estimation are also optimal heavy-tailed estimators. Conversely, for arguably the simplest high-dimensional estimation task of mean estimation, we construct heavy-tailed estimators whose application to the adversarial setting requires any black-box reduction to remove almost all the outliers in the data. Taken together, our results imply that heavy-tailed estimation is likely easier than adversarially robust estimation opening the door to novel algorithmic approaches for the heavy-tailed setting. Additionally, confidence intervals obtained for adversarially robust estimation also hold with high-probability.