AITopics | missranger

Collaborating Authors

missranger

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Which Imputation Fits Which Feature Selection Method? A Survey-Based Simulation Study

Schwerter, Jakob, Romero, Andrés, Dumpert, Florian, Pauly, Markus

arXiv.org Machine LearningDec-18-2024

Tree-based learning methods such as Random Forest and XGBoost are still the gold-standard prediction methods for tabular data. Feature importance measures are usually considered for feature selection as well as to assess the effect of features on the outcome variables in the model. This also applies to survey data, which are frequently encountered in the social sciences and official statistics. These types of datasets often present the challenge of missing values. The typical solution is to impute the missing data before applying the learning method. However, given the large number of possible imputation methods available, the question arises as to which should be chosen to achieve the 'best' reflection of feature importance and feature selection in subsequent analyses. In the present paper, we investigate this question in a survey-based simulation study for eight state-of-the art imputation methods and three learners. The imputation methods comprise listwise deletion, three MICE options, four \texttt{missRanger} options as well as the recently proposed mixGBoost imputation approach. As learners, we consider the two most common tree-based methods, Random Forest and XGBoost, and an interpretable linear model with regularization.

artificial intelligence, imputation method, machine learning, (14 more...)

arXiv.org Machine Learning

2412.1357

Country:

Europe > Austria > Vienna (0.14)
Europe > Germany > North Rhine-Westphalia > Arnsberg Region > Dortmund (0.04)
Europe > Germany > Hesse > Darmstadt Region > Wiesbaden (0.04)
(12 more...)

Genre: Research Report > New Finding (1.00)

Industry:

Government (1.00)
Education > Educational Setting (0.93)
Health & Medicine (0.68)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.51)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.46)

Add feedback

Adapting tree-based multiple imputation methods for multi-level data? A simulation study

Gurtskaia, Ketevan, Schwerter, Jakob, Doebler, Philipp

arXiv.org Machine LearningJan-25-2024

This simulation study evaluates the effectiveness of multiple imputation (MI) techniques for multilevel data. It compares the performance of traditional Multiple Imputation by Chained Equations (MICE) with tree-based methods such as Chained Random Forests with Predictive Mean Matching and Extreme Gradient Boosting. Adapted versions that include dummy variables for cluster membership are also included for the tree-based methods. Methods are evaluated for coefficient estimation bias, statistical power, and type I error rates on simulated hierarchical data with different cluster sizes (25 and 50) and levels of missingness (10\% and 50\%). Coefficients are estimated using random intercept and random slope models. The results show that while MICE is preferred for accurate rejection rates, Extreme Gradient Boosting is advantageous for reducing bias. Furthermore, the study finds that bias levels are similar across different cluster sizes, but rejection rates tend to be less favorable with fewer clusters (lower power, higher type I error). In addition, the inclusion of cluster dummies in tree-based methods improves estimation for Level 1 variables, but is less effective for Level 2 variables. When data become too complex and MICE is too slow, extreme gradient boosting is a good alternative for hierarchical data. Keywords: Multiple imputation; multi-level data; MICE; missRanger; mixgb

imputation, imputation method, missingness, (15 more...)

arXiv.org Machine Learning

2401.14161

Country:

North America > United States > New York (0.04)
North America > United States > Florida > Palm Beach County > Boca Raton (0.04)
Europe > Germany > North Rhine-Westphalia > Arnsberg Region > Dortmund (0.04)
(2 more...)

Genre: Research Report > New Finding (0.88)

Industry: Education (0.93)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.34)

Add feedback

Evaluating tree-based imputation methods as an alternative to MICE PMM for drawing inference in empirical studies

Schwerter, Jakob, Gurtskaia, Ketevan, Romero, Andrés, Zeyer-Gliozzo, Birgit, Pauly, Markus

arXiv.org Machine LearningJan-17-2024

Dealing with missing data is an important problem in statistical analysis that is often addressed with imputation procedures. The performance and validity of such methods are of great importance for their application in empirical studies. While the prevailing method of Multiple Imputation by Chained Equations (MICE) with Predictive Mean Matching (PMM) is considered standard in the social science literature, the increase in complex datasets may require more advanced approaches based on machine learning. In particular, tree-based imputation methods have emerged as very competitive approaches. However, the performance and validity are not completely understood, particularly compared to the standard MICE PMM. This is especially true for inference in linear models. In this study, we investigate the impact of various imputation methods on coefficient estimation, Type I error, and power, to gain insights that can help empirical researchers deal with missingness more effectively. We explore MICE PMM alongside different tree-based methods, such as MICE with Random Forest (RF), Chained Random Forests with and without PMM (missRanger), and Extreme Gradient Boosting (MIXGBoost), conducting a realistic simulation study using the German National Educational Panel Study (NEPS) as the original data source. Our results reveal that Random Forest-based imputations, especially MICE RF and missRanger with PMM, consistently perform better in most scenarios. Standard MICE PMM shows partially increased bias and overly conservative test decisions, particularly with non-true zero coefficients. Our results thus underscore the potential advantages of tree-based imputation methods, albeit with a caveat that all methods perform worse with an increased missingness, particularly missRanger.

coefficient, missingness, missranger, (15 more...)

arXiv.org Machine Learning

2401.09602

Country:

Europe > Austria > Vienna (0.14)
North America > United States > New York (0.04)
Europe > Netherlands (0.04)
(7 more...)

Genre: Research Report > New Finding (1.00)

Industry:

Health & Medicine (0.68)
Education > Educational Setting > K-12 Education (0.68)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Decision Tree Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.46)

Add feedback

A method for comparing multiple imputation techniques: a case study on the U.S. National COVID Cohort Collaborative

Casiraghi, Elena, Wong, Rachel, Hall, Margaret, Coleman, Ben, Notaro, Marco, Evans, Michael D., Tronieri, Jena S., Blau, Hannah, Laraway, Bryan, Callahan, Tiffany J., Chan, Lauren E., Bramante, Carolyn T., Buse, John B., Moffitt, Richard A., Sturmer, Til, Johnson, Steven G., Shao, Yu Raymond, Reese, Justin, Robinson, Peter N., Paccanaro, Alberto, Valentini, Giorgio, Huling, Jared D., Wilkins, Kenneth, :, null, Bennet, Tell, Chute, Christopher, DeWitt, Peter, Gersing, Kenneth, Girvin, Andrew, Haendel, Melissa, Harper, Jeremy, Hajagos, Janos, Hong, Stephanie, Pfaff, Emily, Reusch, Jane, Antoniescu, Corneliu, Robaski, Kimberly

arXiv.org Artificial IntelligenceSep-25-2022

Healthcare datasets obtained from Electronic Health Records have proven to be extremely useful to assess associations between patients' predictors and outcomes of interest. However, these datasets often suffer from missing values in a high proportion of cases and the simple removal of these cases may introduce severe bias. For these reasons, several multiple imputation algorithms have been proposed to attempt to recover the missing information. Each algorithm presents strengths and weaknesses, and there is currently no consensus on which multiple imputation algorithms works best in a given scenario. Furthermore, the selection of each algorithm parameters and data-related modelling choices are also both crucial and challenging. In this paper, we propose a novel framework to numerically evaluate strategies for handling missing data in the context of statistical analysis, with a particular focus on multiple imputation techniques. We demonstrate the feasibility of our approach on a large cohort of type-2 diabetes patients provided by the National COVID Cohort Collaborative (N3C) Enclave, where we explored the influence of various patient characteristics on outcomes related to COVID-19. Our analysis included classic multiple imputation techniques as well as simple complete-case Inverse Probability Weighted models. The experiments presented here show that our approach could effectively highlight the most valid and performant missing-data handling strategy for our case study. Moreover, our methodology allowed us to gain an understanding of the behavior of the different models and of how it changed as we modified their parameters. Our method is general and can be applied to different research fields and on datasets containing heterogeneous types.

data mining, data quality, machine learning, (22 more...)

arXiv.org Artificial Intelligence

2206.06444

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.28)
North America > United States > Pennsylvania > Philadelphia County > Philadelphia (0.14)
North America > United States > Colorado > Denver County > Denver (0.14)
(43 more...)

Genre:

Research Report > Strength High (1.00)
Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry:

Health & Medicine > Therapeutic Area > Oncology (1.00)
Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (1.00)
Health & Medicine > Therapeutic Area > Immunology (1.00)
Health & Medicine > Therapeutic Area > Endocrinology > Diabetes (1.00)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Software (0.92)
(3 more...)

Add feedback