A Generalized Theory of Mixup for Structure-Preserving Synthetic Data

Lee, Chungpa, Im, Jongho, Kim, Joseph H. T.

Mar-3-2025–arXiv.org Machine Learning

A similar approach, SMOTE (Synthetic Minority Over-sampling Technique) (Chawla et al., 2002; He et al., 2008; Bunkhumpornpat et al., 2012; Douzas et al., 2018), also leverages interpolated synthetic instances to enhance model performance particularly for imbalanced or long-tail distributions, showcasing the effectiveness of mixup methods. In this paper we place special focus on data synthesis, an important constituent of data augmentation. While there is extensive research on how synthetic data generated by mixup can enhance model performance (Carratino et al., 2022; Zhang et al., 2021), less attention has been given to understanding the fundamental properties of the synthesized data itself; see Sec. 2.1. In fact most mixup methods generate linearly interpolated instances by taking a weighted average where the weights are randomly drawn from distributions within the range of [0, 1], such as the beta or the uniform distribution. However, this interpolation process reduces the variance, which inevitably distorts the statistical structure of the original dataset both marginally and jointly. The net effect is a less dispersed dataset with more emphasis on representative instances and suppressing the others. In this regard, mixup-based synthetic datasets achieve better performance in training machine learning models from sacrificing non-representative instances, such as the tail instances, in the dataset.

artificial intelligence, machine learning, natural language, (16 more...)

arXiv.org Machine Learning

Mar-3-2025

arXiv.org PDF

Add feedback

Country:
- Asia (0.14)
- North America > Canada
  - Ontario > Toronto (0.14)

Genre:
- Research Report > New Finding (1.00)

Industry:
- Health & Medicine (0.93)

Technology:
- Information Technology
  - Artificial Intelligence
    - Machine Learning
      - Neural Networks > Deep Learning (0.45)
      - Statistical Learning > Regression (0.67)
    - Natural Language (1.00)
  - Data Science (1.00)