Restoring balance: principled under/oversampling of data for optimal classification

Loffredo, Emanuele, Pastore, Mauro, Cocco, Simona, Monasson, Rémi

May-15-2024–arXiv.org Artificial Intelligence

Class imbalance in real-world data poses a common bottleneck for machine learning tasks, since achieving good generalization on under-represented examples is often challenging. Mitigation strategies, such as under or oversampling the data depending on their abundances, are routinely proposed and tested empirically, but how they should adapt to the data statistics remains poorly understood. In this work, we determine exact analytical expressions of the generalization curves in the high-dimensional regime for linear classifiers (Support Vector Machines). We also provide a sharp prediction of the effects of under/oversampling strategies depending on class imbalance, first and second moments of the data, and the metrics of performance considered. We show that mixed strategies involving under and oversampling of data lead to performance improvement. Through numerical experiments, we show the relevance of our theoretical predictions on real datasets, on deeper architectures and with sampling strategies based on unsupervised probabilistic models.

classification, dataset, imbalanced data, (12 more...)

arXiv.org Artificial Intelligence

May-15-2024

arXiv.org PDF

Add feedback

Country:
- Europe
  - United Kingdom > England
    - Cambridgeshire > Cambridge (0.04)
  - France > Île-de-France
    - Paris > Paris (0.04)
- Asia > Japan
  - Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.04)

Genre:
- Research Report (0.64)

Industry:
- Health & Medicine > Pharmaceuticals & Biotechnology (0.67)

Technology:
- Information Technology > Artificial Intelligence > Machine Learning
  - Performance Analysis > Accuracy (1.00)
  - Statistical Learning > Support Vector Machines (0.68)
  - Neural Networks > Deep Learning (0.68)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found