autosmote
Deep Learning Meets Oversampling: A Learning Framework to Handle Imbalanced Classification
Kishanthan, Sukumar, Hevapathige, Asela
This disproportion often leads to biased model training, making the classifier inclined towards predicting the majority class in the inference phase[1, 2]. The class imbalance problem cannot be readily overlooked, as many real-world datasets related to critical tasks, such as those used in the medical field for disease identification, the finance sector for fraud detection, and network intrusion datasets used in cyber security, exhibit such asymmetric class distributions [3, 4, 5]. Existing machine learning and deep learning approaches primarily utilize resampling techniques to tackle class imbalance which involves adjustment techniques to balance the class distribution in datasets [6, 7]. Among diverse resampling techniques, Oversampling approaches are commonly preferred for addressing class imbalance mainly due to their inherent ability to equalize the class distribution while preserving data semantics and achieving superior performance. There has been a plethora of different oversampling techniques proposed in the literature, ranging from traditional approaches [8, 9, 10, 11, 12] to those based on deep learning [13, 14, 15].
Towards Automated Imbalanced Learning with Deep Hierarchical Reinforcement Learning
Zha, Daochen, Lai, Kwei-Herng, Tan, Qiaoyu, Ding, Sirui, Zou, Na, Hu, Xia
Imbalanced learning is a fundamental challenge in data mining, where there is a disproportionate ratio of training samples in each class. Over-sampling is an effective technique to tackle imbalanced learning through generating synthetic samples for the minority class. While numerous over-sampling algorithms have been proposed, they heavily rely on heuristics, which could be sub-optimal since we may need different sampling strategies for different datasets and base classifiers, and they cannot directly optimize the performance metric. Motivated by this, we investigate developing a learning-based over-sampling algorithm to optimize the classification performance, which is a challenging task because of the huge and hierarchical decision space. At the high level, we need to decide how many synthetic samples to generate. At the low level, we need to determine where the synthetic samples should be located, which depends on the high-level decision since the optimal locations of the samples may differ for different numbers of samples. To address the challenges, we propose AutoSMOTE, an automated over-sampling algorithm that can jointly optimize different levels of decisions. Motivated by the success of SMOTE~\cite{chawla2002smote} and its extensions, we formulate the generation process as a Markov decision process (MDP) consisting of three levels of policies to generate synthetic samples within the SMOTE search space. Then we leverage deep hierarchical reinforcement learning to optimize the performance metric on the validation data. Extensive experiments on six real-world datasets demonstrate that AutoSMOTE significantly outperforms the state-of-the-art resampling algorithms. The code is at https://github.com/daochenzha/autosmote