missranger
Which Imputation Fits Which Feature Selection Method? A Survey-Based Simulation Study
Schwerter, Jakob, Romero, Andrés, Dumpert, Florian, Pauly, Markus
Tree-based learning methods such as Random Forest and XGBoost are still the gold-standard prediction methods for tabular data. Feature importance measures are usually considered for feature selection as well as to assess the effect of features on the outcome variables in the model. This also applies to survey data, which are frequently encountered in the social sciences and official statistics. These types of datasets often present the challenge of missing values. The typical solution is to impute the missing data before applying the learning method. However, given the large number of possible imputation methods available, the question arises as to which should be chosen to achieve the 'best' reflection of feature importance and feature selection in subsequent analyses. In the present paper, we investigate this question in a survey-based simulation study for eight state-of-the art imputation methods and three learners. The imputation methods comprise listwise deletion, three MICE options, four \texttt{missRanger} options as well as the recently proposed mixGBoost imputation approach. As learners, we consider the two most common tree-based methods, Random Forest and XGBoost, and an interpretable linear model with regularization.
- Europe > Austria > Vienna (0.14)
- Europe > Germany > North Rhine-Westphalia > Arnsberg Region > Dortmund (0.04)
- Europe > Germany > Hesse > Darmstadt Region > Wiesbaden (0.04)
- (12 more...)
- Government (1.00)
- Education > Educational Setting (0.93)
- Health & Medicine (0.68)
- Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.51)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.46)
Adapting tree-based multiple imputation methods for multi-level data? A simulation study
Gurtskaia, Ketevan, Schwerter, Jakob, Doebler, Philipp
This simulation study evaluates the effectiveness of multiple imputation (MI) techniques for multilevel data. It compares the performance of traditional Multiple Imputation by Chained Equations (MICE) with tree-based methods such as Chained Random Forests with Predictive Mean Matching and Extreme Gradient Boosting. Adapted versions that include dummy variables for cluster membership are also included for the tree-based methods. Methods are evaluated for coefficient estimation bias, statistical power, and type I error rates on simulated hierarchical data with different cluster sizes (25 and 50) and levels of missingness (10\% and 50\%). Coefficients are estimated using random intercept and random slope models. The results show that while MICE is preferred for accurate rejection rates, Extreme Gradient Boosting is advantageous for reducing bias. Furthermore, the study finds that bias levels are similar across different cluster sizes, but rejection rates tend to be less favorable with fewer clusters (lower power, higher type I error). In addition, the inclusion of cluster dummies in tree-based methods improves estimation for Level 1 variables, but is less effective for Level 2 variables. When data become too complex and MICE is too slow, extreme gradient boosting is a good alternative for hierarchical data. Keywords: Multiple imputation; multi-level data; MICE; missRanger; mixgb
- North America > United States > New York (0.04)
- North America > United States > Florida > Palm Beach County > Boca Raton (0.04)
- Europe > Germany > North Rhine-Westphalia > Arnsberg Region > Dortmund (0.04)
- (2 more...)
Evaluating tree-based imputation methods as an alternative to MICE PMM for drawing inference in empirical studies
Schwerter, Jakob, Gurtskaia, Ketevan, Romero, Andrés, Zeyer-Gliozzo, Birgit, Pauly, Markus
Dealing with missing data is an important problem in statistical analysis that is often addressed with imputation procedures. The performance and validity of such methods are of great importance for their application in empirical studies. While the prevailing method of Multiple Imputation by Chained Equations (MICE) with Predictive Mean Matching (PMM) is considered standard in the social science literature, the increase in complex datasets may require more advanced approaches based on machine learning. In particular, tree-based imputation methods have emerged as very competitive approaches. However, the performance and validity are not completely understood, particularly compared to the standard MICE PMM. This is especially true for inference in linear models. In this study, we investigate the impact of various imputation methods on coefficient estimation, Type I error, and power, to gain insights that can help empirical researchers deal with missingness more effectively. We explore MICE PMM alongside different tree-based methods, such as MICE with Random Forest (RF), Chained Random Forests with and without PMM (missRanger), and Extreme Gradient Boosting (MIXGBoost), conducting a realistic simulation study using the German National Educational Panel Study (NEPS) as the original data source. Our results reveal that Random Forest-based imputations, especially MICE RF and missRanger with PMM, consistently perform better in most scenarios. Standard MICE PMM shows partially increased bias and overly conservative test decisions, particularly with non-true zero coefficients. Our results thus underscore the potential advantages of tree-based imputation methods, albeit with a caveat that all methods perform worse with an increased missingness, particularly missRanger.
- Europe > Austria > Vienna (0.14)
- North America > United States > New York (0.04)
- Europe > Netherlands (0.04)
- (7 more...)
- Health & Medicine (0.68)
- Education > Educational Setting > K-12 Education (0.68)
- Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Decision Tree Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.46)
A method for comparing multiple imputation techniques: a case study on the U.S. National COVID Cohort Collaborative
Casiraghi, Elena, Wong, Rachel, Hall, Margaret, Coleman, Ben, Notaro, Marco, Evans, Michael D., Tronieri, Jena S., Blau, Hannah, Laraway, Bryan, Callahan, Tiffany J., Chan, Lauren E., Bramante, Carolyn T., Buse, John B., Moffitt, Richard A., Sturmer, Til, Johnson, Steven G., Shao, Yu Raymond, Reese, Justin, Robinson, Peter N., Paccanaro, Alberto, Valentini, Giorgio, Huling, Jared D., Wilkins, Kenneth, :, null, Bennet, Tell, Chute, Christopher, DeWitt, Peter, Gersing, Kenneth, Girvin, Andrew, Haendel, Melissa, Harper, Jeremy, Hajagos, Janos, Hong, Stephanie, Pfaff, Emily, Reusch, Jane, Antoniescu, Corneliu, Robaski, Kimberly
Healthcare datasets obtained from Electronic Health Records have proven to be extremely useful to assess associations between patients' predictors and outcomes of interest. However, these datasets often suffer from missing values in a high proportion of cases and the simple removal of these cases may introduce severe bias. For these reasons, several multiple imputation algorithms have been proposed to attempt to recover the missing information. Each algorithm presents strengths and weaknesses, and there is currently no consensus on which multiple imputation algorithms works best in a given scenario. Furthermore, the selection of each algorithm parameters and data-related modelling choices are also both crucial and challenging. In this paper, we propose a novel framework to numerically evaluate strategies for handling missing data in the context of statistical analysis, with a particular focus on multiple imputation techniques. We demonstrate the feasibility of our approach on a large cohort of type-2 diabetes patients provided by the National COVID Cohort Collaborative (N3C) Enclave, where we explored the influence of various patient characteristics on outcomes related to COVID-19. Our analysis included classic multiple imputation techniques as well as simple complete-case Inverse Probability Weighted models. The experiments presented here show that our approach could effectively highlight the most valid and performant missing-data handling strategy for our case study. Moreover, our methodology allowed us to gain an understanding of the behavior of the different models and of how it changed as we modified their parameters. Our method is general and can be applied to different research fields and on datasets containing heterogeneous types.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.28)
- North America > United States > Pennsylvania > Philadelphia County > Philadelphia (0.14)
- North America > United States > Colorado > Denver County > Denver (0.14)
- (43 more...)
- Research Report > Strength High (1.00)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
- Health & Medicine > Therapeutic Area > Oncology (1.00)
- Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (1.00)
- Health & Medicine > Therapeutic Area > Immunology (1.00)
- Health & Medicine > Therapeutic Area > Endocrinology > Diabetes (1.00)