AITopics | missing-data mechanism

Collaborating Authors

missing-data mechanism

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Missing Data in Signal Processing and Machine Learning: Models, Methods and Modern Approaches

Hippert-Ferrer, Alexandre, Sportisse, Aude, Javaheri, Amirhossein, Korso, Mohammed Nabil El, Palomar, Daniel P.

arXiv.org Machine LearningJun-4-2025

Missing data appears when parts of the data are not available for a given variable or a given observation. It is an ubiquitous problem in a wide range of scientific disciplines, including sensor networks, geophysical data analysis, radar and image processing, remote sensing, ecological statistics and biomedical studies, just to name a few [1]-[5]. Signal processing is no exception to the rule, where missing data mainly come from sensor malfunction, hidden or impossible measurements, human errors and natural hazards, all of which can hinder a thorough understanding, analysis, and interpretation of the signal. One of the earliest work on missing data was published in 1932 by Wilks, who mentioned the need to extract as much information as possible from fragmentary answers of questionnaires in social sciences and government statistics. Therefore, it is not surprising that the first discipline to witness this issue was mathematical statistics. This led Wilks to derive efficient estimators for the parameters of a normal bivariate distribution when the data contain missing values [6]. This work was extended to the multivariate case by Lord in 1955 [7]. Since the early 1970's, the literature in missing data has flourished with the development of computational capacity, leading to major developments in signal processing and its related fields, such as statistical inference [2], data analysis [8] and machine learning [9]. In particular, the formulation of a missing-data theory framework by Rubin in [10], which describes the relation between missingness and data values in the so-called missing-data mechanisms, has allowed tremendous advancements in statistical analysis. Therefore, a tutorial paper aiming to summarize the existing and novel strategies in the SP & ML literature addressing various problems related to missing data, such as parameter estimation, matrix completion, missing data imputation and learning with missing values, as well as showing their potential applications, is an urgent desideratum. This tutorial aims to provide practitioners with vital tools, in an accessible way, to answer the question: How to deal with missing data? There are many strategies to handle incomplete signals.

data quality, imputation, machine learning, (17 more...)

arXiv.org Machine Learning

2506.01696

Genre:

Research Report (1.00)
Instructional Material > Course Syllabus & Notes (1.00)

Technology:

Information Technology > Data Science > Data Quality (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.46)

Add feedback

Are labels informative in semi-supervised learning? -- Estimating and leveraging the missing-data mechanism

Sportisse, Aude, Schmutz, Hugo, Humbert, Olivier, Bouveyron, Charles, Mattei, Pierre-Alexandre

arXiv.org Machine LearningFeb-15-2023

Semi-supervised learning is a powerful technique for leveraging unlabeled data to improve machine learning models, but it can be affected by the presence of ``informative'' labels, which occur when some classes are more likely to be labeled than others. In the missing data literature, such labels are called missing not at random. In this paper, we propose a novel approach to address this issue by estimating the missing-data mechanism and using inverse propensity weighting to debias any SSL algorithm, including those using data augmentation. We also propose a likelihood ratio test to assess whether or not labels are indeed informative. Finally, we demonstrate the performance of the proposed methods on different datasets, in particular on two medical datasets for which we design pseudo-realistic missing data scenarios.

artificial intelligence, estimator, machine learning, (19 more...)

arXiv.org Machine Learning

2302.0754

Country: Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Genre: Research Report > Promising Solution (0.34)

Industry: Health & Medicine > Therapeutic Area (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Unsupervised or Indirectly Supervised Learning (0.92)
Information Technology > Artificial Intelligence > Machine Learning > Inductive Learning (0.71)

Add feedback

Regression with Missing Data, a Comparison Study of TechniquesBased on Random Forests

Gómez-Méndez, Irving, Joly, Emilien

arXiv.org Machine LearningOct-18-2021

Random forests and recursive trees are widely used in applied statistics and computer science. The popularity of recursive trees relies on several factors: their easy interpretability, the fact that they can be used for both regression and classification tasks, the small number of hyper-parameters to be tuned and finally, their non-parametric nature that allows their use to infer arbitrarily complex relations between the input and the output space. A random forest combines several randomized trees, improving the prediction accuracy at a cost of a slight lost in interpretation. This technique is easily parallelizable which has made it one of the most popular tools for handling high dimensional data sets. It has been successfully involved in various practical problems, including chemioinformatics, ecology, 3D object recognition, bioinformatics and econometrics. Biau and Scornet (2016) present a detailed list of applications as well as a review on random forests. In the present work we have focused on the ability of random forests to deal with missing values.

algorithm, missing-data mechanism, random forest, (14 more...)

arXiv.org Machine Learning

2110.09333

Country: North America > United States > California > San Francisco County > San Francisco (0.14)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Decision Tree Learning (1.00)

Add feedback

Consequences of Model Misspecification for Maximum Likelihood Estimation with Missing Data

#artificialintelligenceSep-6-2019, 15:09:37 GMT

Researchers are often faced with the challenge of developing statistical models with incomplete data. Exacerbating this situation is the possibility that either the researcher's complete-data model or the model of the missing-data mechanism is misspecified. In this article, we create a formal theoretical framework for developing statistical models and detecting model misspecification in the presence of incomplete data where maximum likelihood estimates are obtained by maximizing the observable-data likelihood function when the missing-data mechanism is assumed ignorable. First, we provide sufficient regularity conditions on the researcher's complete-data model to characterize the asymptotic behavior of maximum likelihood estimates in the simultaneous presence of both missing data and model misspecification. These results are then used to derive robust hypothesis testing methods for possibly misspecified models in the presence of Missing at Random (MAR) or Missing Not at Random (MNAR) missing data.

data quality, machine learning, maximum likelihood estimation, (14 more...)

#artificialintelligence

Technology:

Information Technology > Data Science > Data Quality (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (1.00)

Add feedback

Imputation and low-rank estimation with Missing Non At Random data

Sportisse, Aude, Boyer, Claire, Josse, Julie

arXiv.org Machine LearningJan-7-2019

Preprint submitted to January 8, 2019 the use of Expectation-Maximization (EM) algorithm [8] which allows to get the maximum likelihood estimators in various incomplete-data problems [21]. The theoretical guarantees of these methods ensuring the correct prediction of missing values or the correct estimation of some parameters of interest are only valid if some assumptions are made on how the data came to be missing. Rubin [31] introduced three types of missing-data mechanisms: (i) the restrictive assumptions of missing completely at random (MCAR) data, (ii) the missing at random (MAR) data, where the missing data may only depend on the observable variables, and (iii) the more general assumption of missing not at random (MNAR) data, i.e. when the unavailability of the data depends on the values of other variables and its own value. A classic example of MNAR data, which is the focus of the paper, is surveys where rich people would be less willing to disclose their income or where people would be less incline to answer sensitive questions on their addictive use. Another example would be the diagnosis of Alzheimer's disease, which can be made using a score obtained by the patient on a specific test. However, when a patient has the disease, he or she has difficulty answering questions and is more likely to abandon the test before it ends.

algorithm, mechanism, softimpute, (15 more...)

arXiv.org Machine Learning

1812.11409

Country: Europe > France > Île-de-France > Paris > Paris (0.04)

Genre:

Research Report > New Finding (0.47)
Research Report > Experimental Study (0.47)

Industry: Health & Medicine > Therapeutic Area > Neurology > Alzheimer's Disease (0.88)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.34)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.34)

Add feedback

Coupled Compound Poisson Factorization

Basbug, Mehmet E., Engelhardt, Barbara E.

arXiv.org Machine LearningJan-8-2017

We present a general framework, the coupled compound Poisson factorization (CCPF), to capture the missing-data mechanism in extremely sparse data sets by coupling a hierarchical Poisson factorization with an arbitrary data-generating model. We derive a stochastic variational inference algorithm for the resulting model and, as examples of our framework, implement three different data-generating models---a mixture model, linear regression, and factor analysis---to robustly model non-random missing data in the context of clustering, prediction, and matrix factorization. In all three cases, we test our framework against models that ignore the missing-data mechanism on large scale studies with non-random missing data, and we show that explicitly modeling the missing-data mechanism substantially improves the quality of the results, as measured using data log likelihood on a held-out test set.

artificial intelligence, bayesian inference, machine learning, (17 more...)

arXiv.org Machine Learning

1701.02058

Genre: Research Report (0.40)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.51)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.46)

Add feedback