Binary classification with strong class imbalance can be found in many real-world classification problems. From trying to predict events such as network intrusion and bank fraud to a patient's medical diagnosis, the goal in these cases is to be able to identify instances of the minority class -- that is, the class that is underrepresented in the dataset. This, of course, presents a big challenge as most predictive models tend to ignore the more critical minority class while deceptively giving high accuracy results by favoring the majority class. Several techniques have been used to get around the problem of class imbalance, including different sampling methods and modeling algorithms. Examples of sampling methods include adding data samples to the minority class by either duplicating the data or generating synthetic minority samples (oversampling), or randomly removing majority class data to produce a more balanced data distribution (undersampling).
Class imbalance classification is a challenging research problem in data mining and machine learning, as most of the real-life datasets are often imbalanced in nature. Existing learning algorithms maximise the classification accuracy by correctly classifying the majority class, but misclassify the minority class. However, the minority class instances are representing the concept with greater interest than the majority class instances in real-life applications. Recently, several techniques based on sampling methods (under-sampling of the majority class and over-sampling the minority class), cost-sensitive learning methods, and ensemble learning have been used in the literature for classifying imbalanced datasets. In this paper, we introduce a new clustering-based under-sampling approach with boosting (AdaBoost) algorithm, called CUSBoost, for effective imbalanced classification. The proposed algorithm provides an alternative to RUSBoost (random under-sampling with AdaBoost) and SMOTEBoost (synthetic minority over-sampling with AdaBoost) algorithms. We evaluated the performance of CUSBoost algorithm with the state-of-the-art methods based on ensemble learning like AdaBoost, RUSBoost, SMOTEBoost on 13 imbalance binary and multi-class datasets with various imbalance ratios. The experimental results show that the CUSBoost is a promising and effective approach for dealing with highly imbalanced datasets.
A learning classifier must outperform a trivial solution, in case of imbalanced data, this condition usually does not hold true. To overcome this problem, we propose a novel data level resampling method - Clustering Based Oversampling for improved learning from class imbalanced datasets. The essential idea behind the proposed method is to use the distance between a minority class sample and its respective cluster centroid to infer the number of new sample points to be generated for that minority class sample. The proposed algorithm has very less dependence on the technique used for finding cluster centroids and does not effect the majority class learning in any way. It also improves learning from imbalanced data by incorporating the distribution structure of minority class samples in generation of new data samples. The newly generated minority class data is handled in a way as to prevent outlier production and overfitting. Implementation analysis on different datasets using deep neural networks as the learning classifier shows the effectiveness of this method as compared to other synthetic data resampling techniques across several evaluation metrics.
We use a geometric digraph family called class cover catch digraphs (CCCDs) to tackle the class imbalance problem in statistical classification. CCCDs provide graph theoretic solutions to the class cover problem and have been employed in classification. We assess the classification performance of CCCD classifiers by extensive Monte Carlo simulations, comparing them with other classifiers commonly used in the literature. In particular, we show that CCCD classifiers perform relatively well when one class is more frequent than the other in a two-class setting, an example of the class imbalance problem. We also point out the relationship between class imbalance and class overlapping problems, and their influence on the performance of CCCD classifiers and other classification methods as well as some state-of-the-art algorithms which are robust to class imbalance by construction. Experiments on both simulated and real data sets indicate that CCCD classifiers are robust to the class imbalance problem. CCCDs substantially undersample from the majority class while preserving the information on the discarded points during the undersampling process. Many state-of-the-art methods, however, keep this information by means of ensemble classifiers, but CCCDs yield only a single classifier with the same property, making it both appealing and fast.
In this study, we systematically investigate the impact of class imbalance on classification performance of convolutional neural networks (CNNs) and compare frequently used methods to address the issue. Class imbalance is a common problem that has been comprehensively studied in classical machine learning, yet very limited systematic research is available in the context of deep learning. In our study, we use three benchmark datasets of increasing complexity, MNIST, CIFAR-10 and ImageNet, to investigate the effects of imbalance on classification and perform an extensive comparison of several methods to address the issue: oversampling, undersampling, two-phase training, and thresholding that compensates for prior class probabilities. Our main evaluation metric is area under the receiver operating characteristic curve (ROC AUC) adjusted to multi-class tasks since overall accuracy metric is associated with notable difficulties in the context of imbalanced data. Based on results from our experiments we conclude that (i) the effect of class imbalance on classification performance is detrimental; (ii) the method of addressing class imbalance that emerged as dominant in almost all analyzed scenarios was oversampling; (iii) oversampling should be applied to the level that totally eliminates the imbalance, whereas undersampling can perform better when the imbalance is only removed to some extent; (iv) as opposed to some classical machine learning models, oversampling does not necessarily cause overfitting of CNNs; (v) thresholding should be applied to compensate for prior class probabilities when overall number of properly classified cases is of interest.