Impact of Leakage on Data Harmonization in Machine Learning Pipelines in Class Imbalance Across Sites

Nieto, Nicolás, Eickhoff, Simon B., Jung, Christian, Reuter, Martin, Diers, Kersten, Kelm, Malte, Lichtenberg, Artur, Raimondo, Federico, Patil, Kaustubh R.

Dec-10-2024–arXiv.org Artificial Intelligence

Machine learning (ML) models benefit from large datasets. Collecting data in biomedical domains is costly and challenging, hence, combining datasets has become a common practice. However, datasets obtained under different conditions could present undesired site-specific variability. Data harmonization methods aim to remove site-specific variance while retaining biologically relevant information. This study evaluates the effectiveness of popularly used ComBatbased methods for harmonizing data in scenarios where the class balance is not equal across sites. We find that these methods struggle with data leakage issues. To overcome this problem, we propose a novel approach "PrettYharmonize", designed to harmonize data by pretending the target labels. We validate our approach using controlled datasets designed to benchmark the utility of harmonization. Finally, using real-world MRI and clinical data, we compare leakageprone methods with "PrettYharmonize" and show that it achieves comparable performance while avoiding data leakage, particularly in site-target-dependence scenarios.

artificial intelligence, deep learning, machine learning, (17 more...)

arXiv.org Artificial Intelligence

Dec-10-2024

arXiv.org PDF

Add feedback

Country:
- Europe > Germany
  - North Rhine-Westphalia > Düsseldorf Region > Düsseldorf (0.14)
- North America > United States
  - California (0.28)

Genre:
- Research Report
  - Experimental Study (1.00)
  - New Finding (1.00)

Industry:
- Health & Medicine
  - Diagnostic Medicine > Imaging (1.00)
  - Health Care Technology (1.00)
  - Pharmaceuticals & Biotechnology (1.00)
  - Therapeutic Area > Neurology
    - Alzheimer's Disease (0.69)

Technology:
- Information Technology > Artificial Intelligence > Machine Learning
  - Neural Networks > Deep Learning (0.46)
  - Statistical Learning (0.46)