CopulaSMOTE: A Copula-Based Oversampling Approach for Imbalanced Classification in Diabetes Prediction
Aich, Agnideep, Murshed, Md Monzur, Hewage, Sameera, Mayeaux, Amanda
Diabetes mellitus poses a significant health risk, as nearly 1 in 9 people are affected by it. Early detection can significantly lower this risk. Despite significant advancements in machine learning for identifying diabetic cases, results can still be influenced by the imbalanced nature of the data. To address this challenge, our study considered copula-based data augmentation, which preserves the dependency structure when generating data for the minority class and integrates it with machine learning (ML) techniques. We selected the Pima Indian dataset and generated data using A2 copula, then applied four machine learning algorithms: logistic regression, random forest, gradient boosting, and extreme gradient boosting. Our findings indicate that XGBoost combined with A2 copula oversampling achieved the best performance improving accuracy by 4.6%, precision by 15.6%, recall by 20.4%, F1-score by 18.2% and AUC by 25.5% compared to the standard SMOTE method. Furthermore, we statistically validated our results using the McNemar test. This research represents the first known use of A2 copulas for data augmentation and serves as an alternative to the SMOTE technique, highlighting the efficacy of copulas as a statistical method in machine learning applications.
Jun-24-2025
- Country:
- North America
- Canada > Alberta
- United States
- Indiana (0.04)
- Louisiana > Lafayette Parish
- Lafayette (0.04)
- Michigan (0.04)
- Minnesota > Blue Earth County
- Mankato (0.04)
- New York (0.04)
- West Virginia (0.04)
- North America
- Genre:
- Research Report
- Experimental Study (1.00)
- New Finding (1.00)
- Research Report
- Industry:
- Health & Medicine > Therapeutic Area > Endocrinology > Diabetes (1.00)
- Technology:
- Information Technology > Artificial Intelligence > Machine Learning
- Ensemble Learning (1.00)
- Neural Networks (0.93)
- Performance Analysis > Accuracy (1.00)
- Statistical Learning > Regression (1.00)
- Information Technology > Artificial Intelligence > Machine Learning