ELMV: an Ensemble-Learning Approach for Analyzing Electrical Health Records with Significant Missing Values

Liu, Lucas J., Zhang, Hongwei, Di, Jianzhong, Chen, Jin

arXiv.org Machine Learning 

Real-world Electronic Health Record (EHR) data have played an important role in improving patient care and clinician experience and providing rich information for biomedical researches [1, 2, 3]. However, many EHR data contain a significant proportion of missing values, which could be as high as 50%, leading to a substantially reduced sample size even in initially large cohorts if we restrict the analysis to individuals with complete data [4, 5]. On the other hand, leaving a big portion of missing information unaddressed usually cause bias, loss of efficiency, and finally leads to inappropriate conclusion to be drawn [6]. Data imputation algorithms (e.g. the scikit-learn estimators [7]) attempt to replace missing data with meaningful values including random values, the mean or median of rows or columns, spatial-temporal regressed values, most frequent values in the same columns, or representative values identified using k-nearest neighbor [8]. Advanced data imputation algorithms, such as Multivariate Imputation by Chained Equation (MICE) [9], have been developed to fill missing values multiple times. Leveraging the power of GPU and big dta, deep neural network models, such as Datawig [10], can estimate more accurate results than traditional data imputation methods [11].

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found