undersampling
Statistical Undersampling with Mutual Information and Support Points
Mak, Alex, Sahoo, Shubham, Pandey, Shivani, Yue, Yidan, Kong, Linglong
Class imbalance and distributional differences in large datasets present significant challenges for classification tasks machine learning, often leading to biased models and poor predictive performance for minority classes. This work introduces two novel undersampling approaches: mutual information-based stratified simple random sampling and support points optimization. These methods prioritize representative data selection, effectively minimizing information loss. Empirical results across multiple classification tasks demonstrate that our methods outperform traditional undersampling techniques, achieving higher balanced classification accuracy. These findings highlight the potential of combining statistical concepts with machine learning to address class imbalance in practical applications.
The Effect of Balancing Methods on Model Behavior in Imbalanced Classification Problems
Stando, Adrian, Cavus, Mustafa, Biecek, Przemysลaw
Imbalanced data poses a significant challenge in classification as model performance is affected by insufficient learning from minority classes. Balancing methods are often used to address this problem. However, such techniques can lead to problems such as overfitting or loss of information. This study addresses a more challenging aspect of balancing methods - their impact on model behavior. To capture these changes, Explainable Artificial Intelligence tools are used to compare models trained on datasets before and after balancing. In addition to the variable importance method, this study uses the partial dependence profile and accumulated local effects techniques. Real and simulated datasets are tested, and an open-source Python package edgaro is developed to facilitate this analysis. The results obtained show significant changes in model behavior due to balancing methods, which can lead to biased models toward a balanced distribution. These findings confirm that balancing analysis should go beyond model performance comparisons to achieve higher reliability of machine learning models. Therefore, we propose a new method performance gain plot for informed data balancing strategy to make an optimal selection of balancing method by analyzing the measure of change in model behavior versus performance gain.
An Empirical Analysis of the Efficacy of Different Sampling Techniques for Imbalanced Classification
Newaz, Asif, Hassan, Shahriar, Haq, Farhan Shahriyar
Learning from imbalanced data is a challenging task. Standard classification algorithms tend to perform poorly when trained on imbalanced data. Some special strategies need to be adopted, either by modifying the data distribution or by redesigning the underlying classification algorithm to achieve desirable performance. The prevalence of imbalance in real-world datasets has led to the creation of a multitude of strategies for the class imbalance issue. However, not all the strategies are useful or provide good performance in different imbalance scenarios. There are numerous approaches to dealing with imbalanced data, but the efficacy of such techniques or an experimental comparison among those techniques has not been conducted. In this study, we present a comprehensive analysis of 26 popular sampling techniques to understand their effectiveness in dealing with imbalanced data. Rigorous experiments have been conducted on 50 datasets with different degrees of imbalance to thoroughly investigate the performance of these techniques. A detailed discussion of the advantages and limitations of the techniques, as well as how to overcome such limitations, has been presented. We identify some critical factors that affect the sampling strategies and provide recommendations on how to choose an appropriate sampling technique for a particular application.
Handling Class Imbalance Problem
What is the Class Imbalance Problem? Data are said to suffer the Class Imbalance Problem when the class distributions are highly imbalanced. This is a scenario where the number of observations belonging to one class is significantly lower than those belonging to the other classes. Machine Learning algorithms tend to produce unacceptable predictions when faced with imbalanced datasets. Here in this article, we will see some of the techniques on how to handle the Class Imbalance Problem using R. Let's take data where the dependant variable is admission into college based on the independent variables such as GRE score, GPA score, and the ranking of the school.
Detect Toxic Content to Improve Online Conversations
Mediratta, Deepshi, Oswal, Nikhil
Social media is filled with toxic content. The aim of this paper is to build a model that can detect insincere questions. We use the 'Quora Insincere Questions Classification' dataset for our analysis. The dataset is composed of sincere and insincere questions, with the majority of sincere questions. The dataset is processed and analyzed using Python and its libraries such as sklearn, numpy, pandas, keras etc. The dataset is converted to vector form using word embeddings such as GloVe, Wiki-news and TF-IDF. The imbalance in the dataset is handled by resampling techniques. We train and compare various machine learning and deep learning models to come up with the best results. Models discussed include SVM, Naive Bayes, GRU and LSTM.
Dealing with imbalanced data: undersampling, oversampling and proper cross-validation
Inside the cross-validation loop, get a sample out and do not use it for anything related to features selection, oversampling or model building. Oversample your minority class, without the sample you already excluded. Use the excluded sample for validation, and the oversampled minority class the majority class, to create the model. Repeat n times, where n is your number of samples (if doing leave one participant out cross-validation). Inside the cross-validation loop, get a sample out and do not use it for anything related to features selection, oversampling or model building. Oversample your minority class, without the sample you already excluded. Use the excluded sample for validation, and the oversampled minority class the majority class, to create the model. Repeat n times, where n is your number of samples (if doing leave one participant out cross-validation).
Class Imbalance in Credit Card Fraud Detection - Part 3 : Undersampling in Spark
Whenever we do classification in ML, we often assume that target label is evenly distributed in our dataset. This helps the training algorithm to learn the features as we have enough examples for all the different cases. For example, in learning a spam filter, we should have good amount of data which corresponds to emails which are spam and non spam. This even distribution is not always possible. Let's take an example of fraud detection.