Confronting Discrimination in Classification: Smote Based on Marginalized Minorities in the Kernel Space for Imbalanced Data

Zhong, Lingyun

arXiv.org Artificial Intelligence 

The class imbalance problem is a classic classification problem, which arises because the number of negative samples (i.e., majority class) in the data set is much larger than the number of positive samples (i.e., minority class)[4]. This type of problem is common in many fields. For example, in the field of financial fraud, the occurrence of occasional small-probability fraud will cause huge economic losses. Therefore, accurately identifying positive samples will be the key to the class imbalance problem. The first difficulty in the class imbalance problem is mainly due to the rarity of positive samples, which has two connotations[2]: One is absolutely rare, which makes the data not representative enough and has a lot of noise; the other is relatively rare, which causes the feature space to overlap seriously, making it hard for the model to accurately separate the two classes. The second reason is the potential discrimination toward positive samples by current mainstream classifiers. Many current models treat the majority and minority classes equally when evaluating classification accuracy, resulting in the direction of model evaluation being naturally biased towards the majorities; the third reason is the potential discrimination toward important samples in positive samples by the oversampling model. SMOTE, as a classic oversampling method to solve class imbalance[1], only selects the data randomly when expanding the minorities, which may result in more serious feature space overlap because of the ignoration of important samples in minorities. To solve the various problems mentioned above, we propose a hierarchical Smote Based on Marginalized Minorities(MM-SMOTE). First, we use the basic SVM classifier to roughly classify the data, and obtain the support vectors in minorities as important samples for sampling; then assign weights to those support vectors based on their distance to the decision hyperplane; and then based on the k-nearest neighbors of support vectors, we used an adaptive oversampling to generate synthetic samples; finally, synthetic samples are used to augment the original kernel function of the basic SVM to form a new classifier.