Ayme, Alexis
Random features models: a way to study the success of naive imputation
Ayme, Alexis, Boyer, Claire, Dieuleveut, Aymeric, Scornet, Erwan
Constant (naive) imputation is still widely used in practice as this is a first easy-to-use technique to deal with missing data. Yet, this simple method could be expected to induce a large bias for prediction purposes, as the imputed input may strongly differ from the true underlying data. However, recent works suggest that this bias is low in the context of high-dimensional linear predictors when data is supposed to be missing completely at random (MCAR). This paper completes the picture for linear predictors by confirming the intuition that the bias is negligible and that surprisingly naive imputation also remains relevant in very low dimension.To this aim, we consider a unique underlying random features model, which offers a rigorous framework for studying predictive performances, whilst the dimension of the observed features varies.Building on these theoretical results, we establish finite-sample bounds on stochastic gradient (SGD) predictors applied to zero-imputed data, a strategy particularly well suited for large-scale learning.If the MCAR assumption appears to be strong, we show that similar favorable behaviors occur for more complex missing data scenarios.
Minimax rate of consistency for linear models with missing values
Ayme, Alexis, Boyer, Claire, Dieuleveut, Aymeric, Scornet, Erwan
Missing values are more and more present as the size of datasets increases. These missing values can occur for a variety of reasons, such as sensor failures, refusals to answer poll questions, or aggregations of data coming from different sources (with different methods of data collection). There may be different processes of missing value generation on the same dataset, which makes the task of data cleaning difficult or impossible without creating large biases. In his leading work, Rubin [1976] distinguishes three missing values scenarios: Missing Completely At Random (MCAR), Missing At Random (MAR), and Missing Not At Random (MNAR), depending on the links between the observed variables, the missing ones, and the missing pattern. In the linear regression framework, most of the literature focuses on parameter estimation [Little, 1992, Jones, 1996], using sometimes a sparse prior leading to the Lasso estimator [Loh and Wainwright, 2012] or the Dantzig selector [Rosenbaum and Tsybakov, 2010]. Note that the robust estimation literature [Dalalyan and Thompson, 2019, Chen and Caramanis, 2013] could be also used to handle missing values, as the latter can be reinterpreted as a multiplicative noise in linear models.