In classification tasks, one may encounter a situation where the target class label is not equally distributed. Such a dataset can be termed Imbalanced data. Imbalance in data can be a blocker to train a data science model. In case of imbalance class problems, the model is trained mainly on the majority class and the model becomes biased towards the majority class prediction. Hence handling of imbalance class is essential before proceeding to the modeling pipeline.
Kulkarni, Ajay, Chong, Deri, Batarseh, Feras A.
Dealing with imbalanced data is a prevalent problem while performing classification on the datasets. Many times, this problem contributes to bias while making decisions or implementing policies. Thus, it is vital to understand the factors which causes imbalance in the data (or class imbalance). Such hidden biases and imbalances can lead to data tyranny, and a major challenge to a data democracy. In this chapter, two essential statistical elements are resolved: the degree of class imbalance and the complexity of the concept, solving such issues helps in building the foundations of a data democracy. Further, statistical measures which are appropriate in these scenarios are discussed and implemented on a real-life dataset (car insurance claims). In the end, popular data-level methods such as Random Oversampling, Random Undersampling, SMOTE, Tomek Link, and others are implemented in Python, and their performance is compared. Keywords - Imbalanced Data, Degree of Class Imbalance, Complexity of the Concept, Statistical Assessment Metrics, Undersampling and Oversampling 1. Motivation & Introduction In the real-world, data are collected from various sources like social networks, websites, logs, and databases. Whilst dealing with data from different sources, it is very crucial to check the quality of the data [1]. Data with questionable quality can introduce different types of biases in various stages of the data science lifecycle. These biases sometime can affect the association between variables, and in many cases could represent the opposite of the actual behavior [2].
Imbalanced datasets are a special case for classification problem where the class distribution is not uniform among the classes. One of the techniques to handle imbalance datasets is data sampling. Synthetic Minority Oversampling Technique (SMOTE) is an oversampling technique that generates synthetic samples from the minority class to match the majority class. It is used to obtain a synthetically class-balanced or nearly class-balanced training set. SMOTE works by selecting examples that are close in the feature space, drawing a line between the examples in the feature space and drawing a new sample at a point along that line.
A number of classification problems need to deal with data imbalance between classes. Often it is desired to have a high recall on the minority class while maintaining a high precision on the majority class. In this paper, we review a number of resampling techniques proposed in literature to handle unbalanced datasets and study their effect on classification performance.