A common mistake made by beginners is to apply machine learning algorithms to a problem without establishing a performance baseline. A performance baseline provides a minimum score above which a model is considered to have skill on the dataset. It also provides a point of relative improvement for all models evaluated on the dataset. A baseline can be established using a naive classifier, such as predicting one class label for all examples in the test dataset. Another common mistake made by beginners is using classification accuracy as a performance metric on problems that have an imbalanced class distribution.
Classification accuracy is the total number of correct predictions divided by the total number of predictions made for a dataset. As a performance measure, accuracy is inappropriate for imbalanced classification problems. The main reason is that the overwhelming number of examples from the majority class (or classes) will overwhelm the number of examples in the minority class, meaning that even unskillful models can achieve accuracy scores of 90 percent, or 99 percent, depending on how severe the class imbalance happens to be. An alternative to using classification accuracy is to use precision and recall metrics. In this tutorial, you will discover how to calculate and develop an intuition for precision and recall for imbalanced classification.
We have several machine learning algorithms at our disposal for model building. Doing data based prediction is now easier like never before. Whether it is a regression or classification problem, one can effortlessly achieve a reasonably high accuracy using a suitable algorithm. But, this is not the case everytime. Classification problems can sometimes get a bit tricky. ML algorithms tend to tremble when faced with imbalanced classification data sets. Moreover, they result in biased predictions and misleading accuracies. But, why does it happen? What factors deteriorate their performance?
Learning from class-imbalanced data continues to be a common and challenging problem in supervised learning as standard classification algorithms are designed to handle balanced class distributions. While different strategies exist to tackle this problem, methods which generate artificial data to achieve a balanced class distribution are more versatile than modifications to the classification algorithm. Such techniques, called oversamplers, modify the training data, allowing any classifier to be used with class-imbalanced datasets. Many algorithms have been proposed for this task, but most are complex and tend to generate unnecessary noise. This work presents a simple and effective oversampling method based on k-means clustering and SMOTE oversampling, which avoids the generation of noise and effectively overcomes imbalances between and within classes. Empirical results of extensive experiments with 71 datasets show that training data oversampled with the proposed method improves classification results. Moreover, k-means SMOTE consistently outperforms other popular oversampling methods. An implementation is made available in the python programming language.
Class imbalance presents a major hurdle in the application of data mining methods. A common practice to deal with it is to create ensembles of classifiers that learn from resampled balanced data. For example, bagged decision trees combined with random undersampling (RUS) or the synthetic minority oversampling technique (SMOTE). However, most of the resampling methods entail asymmetric changes to the examples of different classes, which in turn can introduce its own biases in the model. Furthermore, those methods require a performance measure to be specified a priori before learning. An alternative is to use a so-called threshold-moving method that a posteriori changes the decision threshold of a model to counteract the imbalance, thus has a potential to adapt to the performance measure of interest. Surprisingly, little attention has been paid to the potential of combining bagging ensemble with threshold-moving. In this paper, we present probability thresholding bagging (PT-bagging), a versatile plug-in method that fills this gap. Contrary to usual rebalancing practice, our method preserves the natural class distribution of the data resulting in well calibrated posterior probabilities. We also extend the proposed method to handle multiclass data. The method is validated on binary and multiclass benchmark data sets. We perform analyses that provide insights into the proposed method.