Ensemble Learning
Interpretable Machine Learning for Cognitive Aging: Handling Missing Data and Uncovering Social Determinant
Mao, Xi, Wang, Zhendong, Li, Jingyu, Mao, Lingchao, Essien, Utibe, Wang, Hairong, Ni, Xuelei Sherry
Early detection of Alzheimer's disease (AD) is crucial because its neurodegenerative effects are irreversible, and neuropathologic and social-behavioral risk factors accumulate years before diagnosis. Identifying higher-risk individuals earlier enables prevention, timely care, and equitable resource allocation. We predict cognitive performance from social determinants of health (SDOH) using the NIH NIA-supported PREPARE Challenge Phase 2 dataset derived from the nationally representative Mex-Cog cohort of the 2003 and 2012 Mexican Health and Aging Study (MHAS). Data: The target is a validated composite cognitive score across seven domains-orientation, memory, attention, language, constructional praxis, and executive function-derived from the 2016 and 2021 MHAS waves. Predictors span demographic, socioeconomic, health, lifestyle, psychosocial, and healthcare access factors. Methodology: Missingness was addressed with a singular value decomposition (SVD)-based imputation pipeline treating continuous and categorical variables separately. This approach leverages latent feature correlations to recover missing values while balancing reliability and scalability. After evaluating multiple methods, XGBoost was chosen for its superior predictive performance. Results and Discussion: The framework outperformed existing methods and the data challenge leaderboard, demonstrating high accuracy, robustness, and interpretability. SHAP-based post hoc analysis identified top contributing SDOH factors and age-specific feature patterns. Notably, flooring material emerged as a strong predictor, reflecting socioeconomic and environmental disparities. Other influential factors, age, SES, lifestyle, social interaction, sleep, stress, and BMI, underscore the multifactorial nature of cognitive aging and the value of interpretable, data-driven SDOH modeling.
Transfer Learning with Distance Covariance for Random Forest: Error Bounds and an EHR Application
Random forest is an important method for ML applications due to its broad outperformance over competing methods for structured tabular data. We propose a method for transfer learning in nonparametric regression using a centered random forest (CRF) with distance covariance-based feature weights, assuming the unknown source and target regression functions are different for a few features (sparsely different). Our method first obtains residuals from predicting the response in the target domain using a source domain-trained CRF. Then, we fit another CRF to the residuals, but with feature splitting probabilities proportional to the sample distance covariance between the features and the residuals in an independent sample. We derive an upper bound on the mean square error rate of the procedure as a function of sample sizes and difference dimension, theoretically demonstrating transfer learning benefits in random forests. In simulations, we show that the results obtained for the CRFs also hold numerically for the standard random forest (SRF) method with data-driven feature split selection. Beyond transfer learning, our results also show the benefit of distance-covariance-based weights on the performance of RF in some situations. Our method shows significant gains in predicting the mortality of ICU patients in smaller-bed target hospitals using a large multi-hospital dataset of electronic health records for 200,000 ICU patients.
Performance Analysis of Machine Learning Algorithms in Chronic Kidney Disease Prediction
Ahmed, Iftekhar, Chowdhury, Tanzil Ebad, Routh, Biggo Bushon, Tasmiya, Nafisa, Sakib, Shadman, Chowdhury, Adil Ahmed
Kidneys are the filter of the human body. About 10% of the global population is thought to be affected by Chronic Kidney Disease (CKD), which causes kidney function to decline. To protect in danger patients from additional kidney damage, effective risk evaluation of CKD and appropriate CKD monitoring are crucial. Due to quick and precise detection capabilities, Machine Learning models can help practitioners accomplish this goal efficiently; therefore, an enormous number of diagnosis systems and processes in the healthcare sector nowadays are relying on machine learning due to its disease prediction capability. In this study, we designed and suggested disease predictive computer-aided designs for the diagnosis of CKD. The dataset for CKD is attained from the repository of machine learning of UCL, with a few missing values; those are filled in using "mean-mode" and "Random sampling method" strategies. After successfully achieving the missing data, eight ML techniques (Random Forest, SVM, Naive Bayes, Logistic Regression, KNN, XGBoost, Decision Tree, and AdaBoost) were used to establish models, and the performance evaluation comparisons among the result accuracies are measured by the techniques to find the machine learning models with the highest accuracy. Among them, Random Forest as well as Logistic Regression showed an outstanding 99% accuracy, followed by the Ada Boost, XGBoost, Naive Bayes, Decision Tree, and SVM, whereas the KNN classifier model stands last with an accuracy of 73%.
Treeffuser: Probabilistic Predictions via Conditional Diffusions with Gradient-Boosted Trees Nicolas Beltran-V elez
Probabilistic prediction aims to compute predictive distributions rather than single point predictions. These distributions enable practitioners to quantify uncertainty, compute risk, and detect outliers. However, most probabilistic methods assume parametric responses, such as Gaussian or Poisson distributions.