AITopics

Country: North America > United States > California (0.28)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry: Health & Medicine (0.67)

Technology:

Information Technology > Data Science > Data Quality (1.00)
Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
(3 more...)

Neural Information Processing SystemsJun-15-2026, 10:41:03 GMT

Unveiling Extraneous Sampling Bias with Data Missing-Not-At-Random

Selection bias poses a widely recognized challenge for unbiased evaluation and learning in many industrial scenarios. For example, in recommender systems, it arises from the users' selective interactions with items. Recently, doubly robust and its variants have been widely studied to achieve debiased learning of prediction models, however, all of them consider a simple exact matching scenario, i.e., the units (such as user-item pairs in a recommender system) are the same between the training and test sets. In practice, there may be limited or even no overlap in units between the training and test. In this paper, we consider a more practical scenario: the joint distribution of the feature and rating is the same in the training and test sets. Theoretical analysis shows that the previous DR estimator is biased even if the imputed errors and learned propensities are correct in this scenario. In addition, we propose a novel super-population doubly robust estimator (SuperDR), which can achieve a more accurate estimation and desirable generalization error bound compared to the existing DR estimators, and extend the joint learning algorithm for training the prediction and imputation models. We conduct extensive experiments on three real-world datasets, including a large-scale industrial dataset, to show the effectiveness of our method.

artificial intelligence, dr estimator, machine learning, (17 more...)

Country: Asia (0.28)

Genre:

Research Report > Experimental Study (1.00)
Research Report > New Finding (0.67)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Personal Assistant Systems (0.86)

Zhao, Huali, Wang, Tianying

Augmented transfer regression learning for completely missing covariates

arXiv.org Machine LearningMay-7-2026

Large-scale population-level datasets, such as the UK Biobank and the All of Us Research Program, often lack covariates needed for a specific analysis, such as genetic or lifestyle measures, while related studies measure them. This creates a cross-population missing data problem in which covariates are completely unobserved in the target population, rather than partially missing within one dataset. We propose an augmented transfer regression learning method for this setting. The key identifying condition is a sub-population shift assumption: the joint distribution of the outcome and observed covariates may differ across source and target populations, but the conditional distribution of the missing covariates given observed variables is invariant. We combine importance-weighted estimating equations with imputation terms for first- and second-order moments of the missing covariates. The resulting estimator is doubly robust, remaining consistent if either the density ratio model or both imputation models are correctly specified. It is $n^{1/2}$-consistent and asymptotically normal, and attains the semiparametric efficiency bound when both nuisance models are correctly specified.

artificial intelligence, machine learning, target population, (18 more...)

2605.04469

Country: North America > United States (0.28)

Genre: Research Report > Experimental Study (0.69)

Industry:

Health & Medicine > Consumer Health (0.93)
Health & Medicine > Therapeutic Area > Oncology (0.93)

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

arXiv.org Machine LearningMay-6-2026

Predicting missing values: A good idea?

van Buuren, Stef

Minimizing the Mean Squared Error (MSE) is a key objective in machine learning and is commonly used for imputing missing values. While this approach provides accurate point estimates, it introduces systematic biases in downstream analyses. These biases affect key parameters such as variance, prevalence, correlation, slope, and explained variance. The root cause is that imputed values optimized for MSE are averages, which reduce the natural variability in the data. This paper demonstrates that adding noise to imputed values can effectively eliminate these biases. The required noise level is proportional to the MSE. Using a toy example in a multivariate normal setting, we compare two methods: predictive imputation, which minimizes MSE, and stochastic imputation, which incorporates random noise. Simulation results show that predictive methods systematically introduce bias, while stochastic methods preserve the data's natural variability and produce unbiased estimates. We also evaluate three popular imputation tools -- missForest, softImpute, and mice -- and observe consistent biases in predictive methods. These findings highlight that MSE is an inadequate measure of imputation quality, as it prioritizes accuracy over variability. Incorporating noise into imputation methods is essential to prevent biases and ensure valid downstream analyses, underscoring the importance of stochastic approaches for handling incomplete data.

artificial intelligence, machine learning, mechanism, (18 more...)

2605.03733

Genre: Research Report (0.69)

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Weill, Nathan, Wang, Kaizheng

Pseudo-Labeling for Unsupervised Domain Adaptation with Kernel GLMs

arXiv.org Machine LearningMar-24-2026

We propose a principled framework for unsupervised domain adaptation under covariate shift in kernel Generalized Linear Models (GLMs), encompassing kernelized linear, logistic, and Poisson regression with ridge regularization. Our goal is to minimize prediction error in the target domain by leveraging labeled source data and unlabeled target data, despite differences in covariate distributions. We partition the labeled source data into two batches: one for training a family of candidate models, and the other for building an imputation model. This imputation model generates pseudo-labels for the target data, enabling robust model selection. We establish non-asymptotic excess-risk bounds that characterize adaptation performance through an "effective labeled sample size", explicitly accounting for the unknown covariate shift. Experiments on synthetic and real datasets demonstrate consistent performance gains over source-only baselines.

artificial intelligence, machine learning, probability, (16 more...)

2603.19422

Country:

North America > United States > New York (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)

Neural Information Processing SystemsFeb-18-2026, 18:51:30 GMT

Unsupervised Anomaly Detection in The Presence of Missing Values

In this work, first, we construct and evaluate a straightforward strategy, "impute-then-detect", via combining state-of-the-art imputation methods with unsupervised anomaly detection methods, where the training data are composed of normal samples only.

artificial intelligence, data mining, machine learning, (19 more...)

Country:

North America > United States (0.14)
Asia > China > Guangdong Province > Shenzhen (0.04)
Asia > China > Hong Kong (0.04)
Oceania > Australia > Victoria > Melbourne (0.04)

Genre: Research Report > Experimental Study (0.93)

Industry: Health & Medicine > Therapeutic Area (0.67)

Technology:

Information Technology > Data Science > Data Mining > Anomaly Detection (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.67)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Neural Information Processing SystemsFeb-16-2026, 11:20:22 GMT

Removing Hidden Confounding in Recommendation: A Unified Multi-Task Learning Approach

In recommender systems, the collected data used for training is always subject to selection bias, which poses a great challenge for unbiased learning.

artificial intelligence, machine learning, recommendation, (18 more...)

Country:

Asia > China > Beijing > Beijing (0.04)
North America > United States > Florida > Palm Beach County > Boca Raton (0.04)
North America > United States > California > San Diego County > San Diego (0.04)
(2 more...)

Genre: Research Report > Experimental Study (0.93)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Personal Assistant Systems (0.88)

arXiv.org Machine LearningJan-13-2026

Multi-environment Invariance Learning with Missing Data

Jia, Yiran

Learning models that can handle distribution shifts is a key challenge in domain generalization. Invariance learning, an approach that focuses on identifying features invariant across environments, improves model generalization by capturing stable relationships, which may represent causal effects when the data distribution is encoded within a structural equation model (SEM) and satisfies modularity conditions. This has led to a growing body of work that builds on invariance learning, leveraging the inherent heterogeneity across environments to develop methods that provide causal explanations while enhancing robust prediction. However, in many practical scenarios, obtaining complete outcome data from each environment is challenging due to the high cost or complexity of data collection. This limitation in available data hinders the development of models that fully leverage environmental heterogeneity, making it crucial to address missing outcomes to improve both causal insights and robust prediction. In this work, we derive an estimator from the invariance objective under missing outcomes. We establish non-asymptotic guarantees on variable selection property and $\ell_2$ error convergence rates, which are influenced by the proportion of missing data and the quality of imputation models across environments. We evaluate the performance of the new estimator through extensive simulations and demonstrate its application using the UCI Bike Sharing dataset to predict the count of bike rentals. The results show that despite relying on a biased imputation model, the estimator is efficient and achieves lower prediction error, provided the bias is within a reasonable range.

artificial intelligence, imputation, machine learning, (20 more...)

2601.07247

Genre: Research Report > New Finding (0.65)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Modeling & Simulation (0.65)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.46)

Nugraha, Agung, Im, Heungjun, Lee, Jihwan

Partial Inverse Design of High-Performance Concrete Using Cooperative Neural Networks for Constraint-Aware Mix Generation

arXiv.org Artificial IntelligenceDec-11-2025

High-performance concrete requires complex mix design decisions involving interdependent variables and practical constraints. While data-driven methods have improved predictive modeling for forward design in concrete engineering, inverse design remains limited, especially when some variables are fixed and only the remaining ones must be inferred. This study proposes a cooperative neural network framework for the partial inverse design of high-performance concrete. The framework integrates an imputation model with a surrogate strength predictor and learns through cooperative training. Once trained, it generates valid and performance-consistent mix designs in a single forward pass without retraining for different constraint scenarios. Compared with baseline models, including autoencoder models and Bayesian inference with Gaussian process surrogates, the proposed method achieves R-squared values of 0.87 to 0.92 and substantially reduces mean squared error by approximately 50% and 70%, respectively. The results show that the framework provides an accurate and computationally efficient foundation for constraint-aware, data-driven mix proportioning.

artificial intelligence, bayesian inference, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2512.06813

Genre: Research Report > New Finding (0.88)

Industry:

Materials > Construction Materials (1.00)
Construction & Engineering (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.46)

Yang, Yanjiao, Suen, Daniel, Chen, Yen-Chi

Masking criteria for selecting an imputation model

arXiv.org Machine LearningNov-14-2025

Missing data is a common problem across various scientific disciplines, including medical research (Bell et al., 2014), social sciences (Molenberghs et al., 2014), and astronomy (Ivezi c et al., 2020). To handle missing entries in the dataset, imputation (Grzesiak et al., 2025; Kim and Shao, 2021; Little and Rubin, 2019) is a popular approach that is widely accepted in practice. An imputation model generates plausible values for each missing entry, transforming an incomplete dataset into a complete one. The critical importance of this task has led to the development of a wide array of imputation models, grounded in various modeling assumptions. These range from traditional approaches like hot-deck imputation (Little and Rubin, 2019) to more sophisticated methods such as Multiple Imputation via Chained Equations (MICE; V an Buuren and Groothuis-Oudshoorn 2011), random forest imputation (Stekhoven and Bühlmann, 2012), techniques based on Markov assumptions on graphs (Y ang and Chen, 2025), and even generative adversarial networks (Y oon et al., 2018). Despite the proliferation of imputation models, the selection of an optimal imputation model for a given dataset remains a significant challenge, largely due to the unsupervised nature of the problem. Among the many proposed strategies for evaluating and selecting imputation models, masking has emerged as a particularly popular procedure (Gelman et al., 1998; Honaker et al., 2011; Leek et al., 2012; Qian et al., 2024; Troyanskaya et al., 2001; Wang et al., 2024). Masking involves intentionally creating missing values in observed entries to create a setting where imputation accuracy can be measured against a known ground truth. This approach has demonstrated remarkable success and power in other domains, notably in language modeling (Devlin et al., 2019; Y ang et al., 2019) and image recognition (Hondru et al., 2025; Vincent et al., 2010; Xie et al., 2022) and prediction-powered inference (Angelopoulos et al., 2023; Wang et al., 2020).

imputation model, machine learning, pattern recognition, (18 more...)

2511.10048

Genre: Research Report (0.64)

Industry: Health & Medicine (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.68)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty (0.67)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.66)
Information Technology > Artificial Intelligence > Machine Learning > Pattern Recognition (0.47)