Goto

Collaborating Authors

 imputation model


Augmented transfer regression learning for completely missing covariates

arXiv.org Machine Learning

Large-scale population-level datasets, such as the UK Biobank and the All of Us Research Program, often lack covariates needed for a specific analysis, such as genetic or lifestyle measures, while related studies measure them. This creates a cross-population missing data problem in which covariates are completely unobserved in the target population, rather than partially missing within one dataset. We propose an augmented transfer regression learning method for this setting. The key identifying condition is a sub-population shift assumption: the joint distribution of the outcome and observed covariates may differ across source and target populations, but the conditional distribution of the missing covariates given observed variables is invariant. We combine importance-weighted estimating equations with imputation terms for first- and second-order moments of the missing covariates. The resulting estimator is doubly robust, remaining consistent if either the density ratio model or both imputation models are correctly specified. It is $n^{1/2}$-consistent and asymptotically normal, and attains the semiparametric efficiency bound when both nuisance models are correctly specified.


Predicting missing values: A good idea?

arXiv.org Machine Learning

Minimizing the Mean Squared Error (MSE) is a key objective in machine learning and is commonly used for imputing missing values. While this approach provides accurate point estimates, it introduces systematic biases in downstream analyses. These biases affect key parameters such as variance, prevalence, correlation, slope, and explained variance. The root cause is that imputed values optimized for MSE are averages, which reduce the natural variability in the data. This paper demonstrates that adding noise to imputed values can effectively eliminate these biases. The required noise level is proportional to the MSE. Using a toy example in a multivariate normal setting, we compare two methods: predictive imputation, which minimizes MSE, and stochastic imputation, which incorporates random noise. Simulation results show that predictive methods systematically introduce bias, while stochastic methods preserve the data's natural variability and produce unbiased estimates. We also evaluate three popular imputation tools -- missForest, softImpute, and mice -- and observe consistent biases in predictive methods. These findings highlight that MSE is an inadequate measure of imputation quality, as it prioritizes accuracy over variability. Incorporating noise into imputation methods is essential to prevent biases and ensure valid downstream analyses, underscoring the importance of stochastic approaches for handling incomplete data.


Pseudo-Labeling for Unsupervised Domain Adaptation with Kernel GLMs

arXiv.org Machine Learning

We propose a principled framework for unsupervised domain adaptation under covariate shift in kernel Generalized Linear Models (GLMs), encompassing kernelized linear, logistic, and Poisson regression with ridge regularization. Our goal is to minimize prediction error in the target domain by leveraging labeled source data and unlabeled target data, despite differences in covariate distributions. We partition the labeled source data into two batches: one for training a family of candidate models, and the other for building an imputation model. This imputation model generates pseudo-labels for the target data, enabling robust model selection. We establish non-asymptotic excess-risk bounds that characterize adaptation performance through an "effective labeled sample size", explicitly accounting for the unknown covariate shift. Experiments on synthetic and real datasets demonstrate consistent performance gains over source-only baselines.


Unsupervised Anomaly Detection in The Presence of Missing Values

Neural Information Processing Systems

In this work, first, we construct and evaluate a straightforward strategy, "impute-then-detect", via combining state-of-the-art imputation methods with unsupervised anomaly detection methods, where the training data are composed of normal samples only.



Multi-environment Invariance Learning with Missing Data

arXiv.org Machine Learning

Learning models that can handle distribution shifts is a key challenge in domain generalization. Invariance learning, an approach that focuses on identifying features invariant across environments, improves model generalization by capturing stable relationships, which may represent causal effects when the data distribution is encoded within a structural equation model (SEM) and satisfies modularity conditions. This has led to a growing body of work that builds on invariance learning, leveraging the inherent heterogeneity across environments to develop methods that provide causal explanations while enhancing robust prediction. However, in many practical scenarios, obtaining complete outcome data from each environment is challenging due to the high cost or complexity of data collection. This limitation in available data hinders the development of models that fully leverage environmental heterogeneity, making it crucial to address missing outcomes to improve both causal insights and robust prediction. In this work, we derive an estimator from the invariance objective under missing outcomes. We establish non-asymptotic guarantees on variable selection property and $\ell_2$ error convergence rates, which are influenced by the proportion of missing data and the quality of imputation models across environments. We evaluate the performance of the new estimator through extensive simulations and demonstrate its application using the UCI Bike Sharing dataset to predict the count of bike rentals. The results show that despite relying on a biased imputation model, the estimator is efficient and achieves lower prediction error, provided the bias is within a reasonable range.


Partial Inverse Design of High-Performance Concrete Using Cooperative Neural Networks for Constraint-Aware Mix Generation

arXiv.org Artificial Intelligence

High-performance concrete requires complex mix design decisions involving interdependent variables and practical constraints. While data-driven methods have improved predictive modeling for forward design in concrete engineering, inverse design remains limited, especially when some variables are fixed and only the remaining ones must be inferred. This study proposes a cooperative neural network framework for the partial inverse design of high-performance concrete. The framework integrates an imputation model with a surrogate strength predictor and learns through cooperative training. Once trained, it generates valid and performance-consistent mix designs in a single forward pass without retraining for different constraint scenarios. Compared with baseline models, including autoencoder models and Bayesian inference with Gaussian process surrogates, the proposed method achieves R-squared values of 0.87 to 0.92 and substantially reduces mean squared error by approximately 50% and 70%, respectively. The results show that the framework provides an accurate and computationally efficient foundation for constraint-aware, data-driven mix proportioning.


Masking criteria for selecting an imputation model

arXiv.org Machine Learning

Missing data is a common problem across various scientific disciplines, including medical research (Bell et al., 2014), social sciences (Molenberghs et al., 2014), and astronomy (Ivezi c et al., 2020). To handle missing entries in the dataset, imputation (Grzesiak et al., 2025; Kim and Shao, 2021; Little and Rubin, 2019) is a popular approach that is widely accepted in practice. An imputation model generates plausible values for each missing entry, transforming an incomplete dataset into a complete one. The critical importance of this task has led to the development of a wide array of imputation models, grounded in various modeling assumptions. These range from traditional approaches like hot-deck imputation (Little and Rubin, 2019) to more sophisticated methods such as Multiple Imputation via Chained Equations (MICE; V an Buuren and Groothuis-Oudshoorn 2011), random forest imputation (Stekhoven and Bรผhlmann, 2012), techniques based on Markov assumptions on graphs (Y ang and Chen, 2025), and even generative adversarial networks (Y oon et al., 2018). Despite the proliferation of imputation models, the selection of an optimal imputation model for a given dataset remains a significant challenge, largely due to the unsupervised nature of the problem. Among the many proposed strategies for evaluating and selecting imputation models, masking has emerged as a particularly popular procedure (Gelman et al., 1998; Honaker et al., 2011; Leek et al., 2012; Qian et al., 2024; Troyanskaya et al., 2001; Wang et al., 2024). Masking involves intentionally creating missing values in observed entries to create a setting where imputation accuracy can be measured against a known ground truth. This approach has demonstrated remarkable success and power in other domains, notably in language modeling (Devlin et al., 2019; Y ang et al., 2019) and image recognition (Hondru et al., 2025; Vincent et al., 2010; Xie et al., 2022) and prediction-powered inference (Angelopoulos et al., 2023; Wang et al., 2020).


Missing Data Multiple Imputation for Tabular Q-Learning in Online RL

arXiv.org Machine Learning

Missing data in online reinforcement learning (RL) poses challenges compared to missing data in standard tabular data or in offline policy learning. The need to impute and act at each time step means that imputation cannot be put off until enough data exist to produce stable imputation models. It also means future data collection and learning depend on previous imputations. This paper proposes fully online imputation ensembles. We find that maintaining multiple imputation pathways may help balance the need to capture uncertainty under missingness and the need for efficiency in online settings. We consider multiple approaches for incorporating these pathways into learning and action selection. Using a Grid World experiment with various types of missingness, we provide preliminary evidence that multiple imputation pathways may be a useful framework for constructing simple and efficient online missing data RL methods.