Goto

Collaborating Authors

 imputation technique


Closing Gaps: An Imputation Analysis of ICU Vital Signs

Turubayev, Alisher, Shopova, Anna, Lange, Fabian, Kamalak, Mahmut, Mattes, Paul, Ayvasky, Victoria, Arnrich, Bert, Pfitzner, Bjarne, van de Water, Robin P.

arXiv.org Artificial Intelligence

As more Intensive Care Unit (ICU) data becomes available, the interest in developing clinical prediction models to improve healthcare protocols increases. However, the lack of data quality still hinders clinical prediction using Machine Learning (ML). Many vital sign measurements, such as heart rate, contain sizeable missing segments, leaving gaps in the data that could negatively impact prediction performance. Previous works have introduced numerous time-series imputation techniques. Nevertheless, more comprehensive work is needed to compare a representative set of methods for imputing ICU vital signs and determine the best practice. In reality, ad-hoc imputation techniques that could decrease prediction accuracy, like zero imputation, are still used. In this work, we compare established imputation techniques to guide researchers in improving the performance of clinical prediction models by selecting the most accurate imputation technique. We introduce an extensible and reusable benchmark with currently 15 imputation and 4 amputation methods, created for benchmarking on major ICU datasets. We hope to provide a comparative basis and facilitate further ML development to bring more models into clinical practice.


Data as a Lever: A Neighbouring Datasets Perspective on Predictive Multiplicity

Ganesh, Prakhar, Hsu, Hsiang, Farnadi, Golnoosh

arXiv.org Artificial Intelligence

Multiplicity -- the existence of distinct models with comparable performance -- has received growing attention in recent years. While prior work has largely emphasized modelling choices, the critical role of data in shaping multiplicity has been comparatively overlooked. In this work, we introduce a neighbouring datasets framework to examine the most granular case: the impact of a single-data-point difference on multiplicity. Our analysis yields a seemingly counterintuitive finding: neighbouring datasets with greater inter-class distribution overlap exhibit lower multiplicity. This reversal of conventional expectations arises from a shared Rashomon parameter, and we substantiate it with rigorous proofs. Building on this foundation, we extend our framework to two practical domains: active learning and data imputation. For each, we establish natural extensions of the neighbouring datasets perspective, conduct the first systematic study of multiplicity in existing algorithms, and finally, propose novel multiplicity-aware methods, namely, multiplicity-aware data acquisition strategies for active learning and multiplicity-aware data imputation techniques.


Meta-Imputation Balanced (MIB): An Ensemble Approach for Handling Missing Data in Biomedical Machine Learning

Azad, Fatemeh, Bosnić, Zoran, Kukar, Matjaž

arXiv.org Artificial Intelligence

--Missing data represents a fundamental challenge in machine learning applications, often reducing model performance and reliability. This problem is particularly acute in fields like bioinformatics and clinical machine learning, where datasets are frequently incomplete due to the nature of both data generation and data collection. While numerous imputation methods exist, from simple statistical techniques to advanced deep learning models, no single method consistently performs well across diverse datasets and missingness mechanisms. This paper proposes a novel Meta-Imputation approach that learns to combine the outputs of multiple base imputers to predict missing values more accurately. By training the proposed method called Meta-Imputation Balanced (MIB) on synthetically masked data with known ground truth, the system learns to predict the most suitable imputed value based on the behavior of each method. We evaluate our method on tabular data under the Missing Completely at Random (MCAR) assumption using both direct metrics, where Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE) are computed between imputed values and their corresponding original ground truth values in the artificially masked positions, and indirect metrics, which measure the RMSE of a target variable predicted by machine learning models trained on the imputed datasets. Across three benchmark datasets, the model achieved the lowest or near-lowest RMSE and delivered stable downstream predictive performance, even when individual imputers varied in performance.


Evaluating Imputation Techniques for Short-Term Gaps in Heart Rate Data

Gupta, Vaibhav, Maleshkova, Maria

arXiv.org Artificial Intelligence

Recent advances in wearable technology have enabled the continuous monitoring of vital physiological signals, essential for predictive modeling and early detection of extreme physiological events. Among these physiological signals, heart rate (HR) plays a central role, as it is widely used in monitoring and managing cardiovascular conditions and detecting extreme physiological events such as hypoglycemia. However, data from wearable devices often suffer from missing values. To address this issue, recent studies have employed various imputation techniques. Traditionally, the effectiveness of these methods has been evaluated using predictive accuracy metrics such as RMSE, MAPE, and MAE, which assess numerical proximity to the original data. While informative, these metrics fail to capture the complex statistical structure inherent in physiological signals. This study bridges this gap by presenting a comprehensive evaluation of four statistical imputation methods, linear interpolation, K Nearest Neighbors (KNN), Piecewise Cubic Hermite Interpolating Polynomial (PCHIP), and B splines, for short term HR data gaps. We assess their performance using both predictive accuracy metrics and statistical distance measures, including the Cohen Distance Test (CDT) and Jensen Shannon Distance (JS Distance), applied to HR data from the D1NAMO dataset and the BIG IDEAs Lab Glycemic Variability and Wearable Device dataset. The analysis reveals limitations in existing imputation approaches and the absence of a robust framework for evaluating imputation quality in physiological signals. Finally, this study proposes a foundational framework to develop a composite evaluation metric to assess imputation performance.


MoCap-Impute: A Comprehensive Benchmark and Comparative Analysis of Imputation Methods for IMU-based Motion Capture Data

Bekhit, Mahmoud, Salah, Ahmad, Alrawahi, Ahmed Salim, Attia, Tarek, Ali, Ahmed, Eldesokey, Esraa, Fathalla, Ahmed

arXiv.org Artificial Intelligence

Motion capture (MoCap) data from wearable Inertial Measurement Units (IMUs) is vital for applications in sports science, but its utility is often compromised by missing data. Despite numerous imputation techniques, a systematic performance evaluation for IMU-derived MoCap time-series data is lacking. We address this gap by conducting a comprehensive comparative analysis of statistical, machine learning, and deep learning imputation methods. Our evaluation considers three distinct contexts: univariate time-series, multivariate across subjects, and multivariate across kinematic angles. To facilitate this benchmark, we introduce the first publicly available MoCap dataset designed specifically for imputation, featuring data from 53 karate practitioners. We simulate three controlled missingness mechanisms: missing completely at random (MCAR), block missingness, and a novel value-dependent pattern at signal transition points. Our experiments, conducted on 39 kinematic variables across all subjects, reveal that multivariate imputation frameworks consistently outperform univariate approaches, particularly for complex missingness. For instance, multivariate methods achieve up to a 50% mean absolute error reduction (MAE from 10.8 to 5.8) compared to univariate techniques for transition point missingness. Advanced models like Generative Adversarial Imputation Networks (GAIN) and Iterative Imputers demonstrate the highest accuracy in these challenging scenarios. This work provides a critical baseline for future research and offers practical recommendations for improving the integrity and robustness of Mo-Cap data analysis.


Filling in the Blanks: Applying Data Imputation in incomplete Water Metering Data

Amaxilatis, Dimitrios, Sarantakos, Themistoklis, Chatzigiannakis, Ioannis, Mylonas, Georgios

arXiv.org Artificial Intelligence

--In this work, we explore the application of recent data imputation techniques to enhance monitoring and management of water distribution networks using smart water meters, based on data derived from a real-world IoT water grid monitoring deployment. Despite the detailed data produced by such meters, data gaps due to technical issues can significantly impact operational decisions and efficiency. Our results, by comparing various imputation methods, such as k-Nearest Neighbors, MissForest, Transformers, and Recurrent Neural Networks, indicate that effective data imputation can substantially enhance the quality of the insights derived from water consumption data as we study their effect on accuracy and reliability of water metering data to provide solutions in applications like leak detection and predictive maintenance scheduling. In the era of smart cities and advanced utility management, the monitoring of water grids has become increasingly pivotal to ensuring efficient distribution, sustainability, and infrastructure reliability. However, despite their sophistication, the occurrence of missing data due to various factors--ranging from technical malfunctions to data transmission errors-- remains an open challenge that undermines the integrity and actionable insights that can be derived from the datasets produced by such infrastructure. Moreover, the significance of addressing missing data extends beyond mere data completeness. In the context of water grid monitoring, it impacts decision-making processes related to water management, leak detection, and predictive maintenance, all of which have profound implications for operational efficiency and environmental sustainability.


Activity and Subject Detection for UCI HAR Dataset with & without missing Sensor Data

Saha, Debashish, Malik, Piyush, Saha, Adrika

arXiv.org Artificial Intelligence

Current studies in Human Activity Recognition (HAR) primarily focus on the classification of activities through sensor data, while there is not much emphasis placed on recognizing the individuals performing these activities. This type of classification is very important for developing personalized and context-sensitive applications. Additionally, the issue of missing sensor data, which often occurs in practical situations due to hardware malfunctions, has not been explored yet. This paper seeks to fill these voids by introducing a lightweight LSTM-based model that can be used to classify both activities and subjects. The proposed model was used to classify the HAR dataset by UCI [1], achieving an accuracy of 93.89% in activity recognition (across six activities), nearing the 96.67% benchmark, and an accuracy of 80.19% in subject recognition (involving 30 subjects), thereby establishing a new baseline for this area of research. We then simulate the absence of sensor data to mirror real-world scenarios and incorporate imputation techniques, both with and without Principal Component Analysis (PCA), to restore incomplete datasets. We found that K-Nearest Neighbors (KNN) imputation performs the best for filling the missing sensor data without PCA because the use of PCA resulted in slightly lower accuracy. These results demonstrate how well the framework handles missing sensor data, which is a major step forward in using the Human Activity Recognition dataset for reliable classification tasks.


Performance of Machine Learning Classifiers for Anomaly Detection in Cyber Security Applications

Haug, Markus, Velarde, Gissel

arXiv.org Artificial Intelligence

This work empirically evaluates machine learning models on two imbalanced public datasets (KDDCUP99 and Credit Card Fraud 2013). The method includes data preparation, model training, and evaluation, using an 80/20 (train/test) split. Models tested include eXtreme Gradient Boosting (XGB), Multi Layer Perceptron (MLP), Generative Adversarial Network (GAN), Variational Autoencoder (VAE), and Multiple-Objective Generative Adversarial Active Learning (MO-GAAL), with XGB and MLP further combined with Random-Over-Sampling (ROS) and Self-Paced-Ensemble (SPE). Evaluation involves 5-fold cross-validation and imputation techniques (mean, median, and IterativeImputer) with 10, 20, 30, and 50 % missing data. Findings show XGB and MLP outperform generative models. IterativeImputer results are comparable to mean and median, but not recommended for large datasets due to increased complexity and execution time. The code used is publicly available on GitHub (github.com/markushaug/acr-25).


The influence of missing data mechanisms and simple missing data handling techniques on fairness

Bhatti, Aeysha, Sandrock, Trudie, Nienkemper-Swanepoel, Johane

arXiv.org Machine Learning

Fairness of machine learning algorithms is receiving increasing attention, as such algorithms permeate the day-to-day aspects of our lives. One way in which bias can manifest in a dataset is through missing values. If data are missing, these data are often assumed to be missing completely randomly; in reality the propensity of data being missing is often tied to the demographic characteristics of individuals. There is limited research into how missing values and the handling thereof can impact the fairness of an algorithm. Most researchers either apply listwise deletion or tend to use the simpler methods of imputation (e.g. mean or mode) compared to the more advanced ones (e.g. multiple imputation); we therefore study the impact of the simpler methods on the fairness of algorithms. The starting point of the study is the mechanism of missingness, leading into how the missing data are processed and finally how this impacts fairness. Three popular datasets in the field of fairness are amputed in a simulation study. The results show that under certain scenarios the impact on fairness can be pronounced when the missingness mechanism is missing at random. Furthermore, elementary missing data handling techniques like listwise deletion and mode imputation can lead to higher fairness compared to more complex imputation methods like k-nearest neighbour imputation, albeit often at the cost of lower accuracy.


Masking the Gaps: An Imputation-Free Approach to Time Series Modeling with Missing Data

Neog, Abhilash, Daw, Arka, Khorasgani, Sepideh Fatemi, Karpatne, Anuj

arXiv.org Artificial Intelligence

A significant challenge in time-series (TS) modeling is the presence of missing values in real-world TS datasets. Traditional two-stage frameworks, involving imputation followed by modeling, suffer from two key drawbacks: (1) the propagation of imputation errors into subsequent TS modeling, (2) the trade-offs between imputation efficacy and imputation complexity. While one-stage approaches attempt to address these limitations, they often struggle with scalability or fully leveraging partially observed features. To this end, we propose a novel imputation-free approach for handling missing values in time series termed Missing Feature-aware Time Series Modeling (MissTSM) with two main innovations. First, we develop a novel embedding scheme that treats every combination of time-step and feature (or channel) as a distinct token. Second, we introduce a novel Missing Feature-Aware Attention (MFAA) Layer to learn latent representations at every time-step based on partially observed features. We evaluate the effectiveness of MissTSM in handling missing values over multiple benchmark datasets.