Goto

Collaborating Authors

 missing-data mechanism


Missing Data in Signal Processing and Machine Learning: Models, Methods and Modern Approaches

arXiv.org Machine Learning

Missing data appears when parts of the data are not available for a given variable or a given observation. It is an ubiquitous problem in a wide range of scientific disciplines, including sensor networks, geophysical data analysis, radar and image processing, remote sensing, ecological statistics and biomedical studies, just to name a few [1]-[5]. Signal processing is no exception to the rule, where missing data mainly come from sensor malfunction, hidden or impossible measurements, human errors and natural hazards, all of which can hinder a thorough understanding, analysis, and interpretation of the signal. One of the earliest work on missing data was published in 1932 by Wilks, who mentioned the need to extract as much information as possible from fragmentary answers of questionnaires in social sciences and government statistics. Therefore, it is not surprising that the first discipline to witness this issue was mathematical statistics. This led Wilks to derive efficient estimators for the parameters of a normal bivariate distribution when the data contain missing values [6]. This work was extended to the multivariate case by Lord in 1955 [7]. Since the early 1970's, the literature in missing data has flourished with the development of computational capacity, leading to major developments in signal processing and its related fields, such as statistical inference [2], data analysis [8] and machine learning [9]. In particular, the formulation of a missing-data theory framework by Rubin in [10], which describes the relation between missingness and data values in the so-called missing-data mechanisms, has allowed tremendous advancements in statistical analysis. Therefore, a tutorial paper aiming to summarize the existing and novel strategies in the SP & ML literature addressing various problems related to missing data, such as parameter estimation, matrix completion, missing data imputation and learning with missing values, as well as showing their potential applications, is an urgent desideratum. This tutorial aims to provide practitioners with vital tools, in an accessible way, to answer the question: How to deal with missing data? There are many strategies to handle incomplete signals.


Are labels informative in semi-supervised learning? -- Estimating and leveraging the missing-data mechanism

arXiv.org Machine Learning

Semi-supervised learning is a powerful technique for leveraging unlabeled data to improve machine learning models, but it can be affected by the presence of ``informative'' labels, which occur when some classes are more likely to be labeled than others. In the missing data literature, such labels are called missing not at random. In this paper, we propose a novel approach to address this issue by estimating the missing-data mechanism and using inverse propensity weighting to debias any SSL algorithm, including those using data augmentation. We also propose a likelihood ratio test to assess whether or not labels are indeed informative. Finally, we demonstrate the performance of the proposed methods on different datasets, in particular on two medical datasets for which we design pseudo-realistic missing data scenarios.


Regression with Missing Data, a Comparison Study of TechniquesBased on Random Forests

arXiv.org Machine Learning

Random forests and recursive trees are widely used in applied statistics and computer science. The popularity of recursive trees relies on several factors: their easy interpretability, the fact that they can be used for both regression and classification tasks, the small number of hyper-parameters to be tuned and finally, their non-parametric nature that allows their use to infer arbitrarily complex relations between the input and the output space. A random forest combines several randomized trees, improving the prediction accuracy at a cost of a slight lost in interpretation. This technique is easily parallelizable which has made it one of the most popular tools for handling high dimensional data sets. It has been successfully involved in various practical problems, including chemioinformatics, ecology, 3D object recognition, bioinformatics and econometrics. Biau and Scornet (2016) present a detailed list of applications as well as a review on random forests. In the present work we have focused on the ability of random forests to deal with missing values.


Consequences of Model Misspecification for Maximum Likelihood Estimation with Missing Data

#artificialintelligence

Researchers are often faced with the challenge of developing statistical models with incomplete data. Exacerbating this situation is the possibility that either the researcher's complete-data model or the model of the missing-data mechanism is misspecified. In this article, we create a formal theoretical framework for developing statistical models and detecting model misspecification in the presence of incomplete data where maximum likelihood estimates are obtained by maximizing the observable-data likelihood function when the missing-data mechanism is assumed ignorable. First, we provide sufficient regularity conditions on the researcher's complete-data model to characterize the asymptotic behavior of maximum likelihood estimates in the simultaneous presence of both missing data and model misspecification. These results are then used to derive robust hypothesis testing methods for possibly misspecified models in the presence of Missing at Random (MAR) or Missing Not at Random (MNAR) missing data.


Imputation and low-rank estimation with Missing Non At Random data

arXiv.org Machine Learning

Preprint submitted to January 8, 2019 the use of Expectation-Maximization (EM) algorithm [8] which allows to get the maximum likelihood estimators in various incomplete-data problems [21]. The theoretical guarantees of these methods ensuring the correct prediction of missing values or the correct estimation of some parameters of interest are only valid if some assumptions are made on how the data came to be missing. Rubin [31] introduced three types of missing-data mechanisms: (i) the restrictive assumptions of missing completely at random (MCAR) data, (ii) the missing at random (MAR) data, where the missing data may only depend on the observable variables, and (iii) the more general assumption of missing not at random (MNAR) data, i.e. when the unavailability of the data depends on the values of other variables and its own value. A classic example of MNAR data, which is the focus of the paper, is surveys where rich people would be less willing to disclose their income or where people would be less incline to answer sensitive questions on their addictive use. Another example would be the diagnosis of Alzheimer's disease, which can be made using a score obtained by the patient on a specific test. However, when a patient has the disease, he or she has difficulty answering questions and is more likely to abandon the test before it ends.


Coupled Compound Poisson Factorization

arXiv.org Machine Learning

We present a general framework, the coupled compound Poisson factorization (CCPF), to capture the missing-data mechanism in extremely sparse data sets by coupling a hierarchical Poisson factorization with an arbitrary data-generating model. We derive a stochastic variational inference algorithm for the resulting model and, as examples of our framework, implement three different data-generating models---a mixture model, linear regression, and factor analysis---to robustly model non-random missing data in the context of clustering, prediction, and matrix factorization. In all three cases, we test our framework against models that ignore the missing-data mechanism on large scale studies with non-random missing data, and we show that explicitly modeling the missing-data mechanism substantially improves the quality of the results, as measured using data log likelihood on a held-out test set.