AITopics | undersampling

Collaborating Authors

undersampling

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Statistical Undersampling with Mutual Information and Support Points

Mak, Alex, Sahoo, Shubham, Pandey, Shivani, Yue, Yidan, Kong, Linglong

arXiv.org Machine LearningDec-18-2024

Class imbalance and distributional differences in large datasets present significant challenges for classification tasks machine learning, often leading to biased models and poor predictive performance for minority classes. This work introduces two novel undersampling approaches: mutual information-based stratified simple random sampling and support points optimization. These methods prioritize representative data selection, effectively minimizing information loss. Empirical results across multiple classification tasks demonstrate that our methods outperform traditional undersampling techniques, achieving higher balanced classification accuracy. These findings highlight the potential of combining statistical concepts with machine learning to address class imbalance in practical applications.

artificial intelligence, dataset, machine learning, (14 more...)

arXiv.org Machine Learning

2412.14527

Country:

North America > United States (0.04)
North America > Canada > Alberta > Census Division No. 11 > Edmonton Metropolitan Region > Edmonton (0.04)

Genre: Research Report > New Finding (1.00)

Industry: Health & Medicine (0.47)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)

Add feedback

The Effect of Balancing Methods on Model Behavior in Imbalanced Classification Problems

Stando, Adrian, Cavus, Mustafa, Biecek, Przemysław

arXiv.org Artificial IntelligenceJun-30-2023

Imbalanced data poses a significant challenge in classification as model performance is affected by insufficient learning from minority classes. Balancing methods are often used to address this problem. However, such techniques can lead to problems such as overfitting or loss of information. This study addresses a more challenging aspect of balancing methods - their impact on model behavior. To capture these changes, Explainable Artificial Intelligence tools are used to compare models trained on datasets before and after balancing. In addition to the variable importance method, this study uses the partial dependence profile and accumulated local effects techniques. Real and simulated datasets are tested, and an open-source Python package edgaro is developed to facilitate this analysis. The results obtained show significant changes in model behavior due to balancing methods, which can lead to biased models toward a balanced distribution. These findings confirm that balancing analysis should go beyond model performance comparisons to achieve higher reliability of machine learning models. Therefore, we propose a new method performance gain plot for informed data balancing strategy to make an optimal selection of balancing method by analyzing the measure of change in model behavior versus performance gain.

artificial intelligence, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2307.00157

Country:

Europe > Poland > Masovia Province > Warsaw (0.05)
Asia > Middle East > Republic of Türkiye > Eskisehir Province > Eskisehir (0.05)
South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
(2 more...)

Genre: Research Report > New Finding (1.00)

Industry: Health & Medicine (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Expert Systems (0.46)
Information Technology > Artificial Intelligence > Natural Language > Explanation & Argumentation (0.34)
Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (0.34)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.34)

Add feedback

An Empirical Analysis of the Efficacy of Different Sampling Techniques for Imbalanced Classification

Newaz, Asif, Hassan, Shahriar, Haq, Farhan Shahriyar

arXiv.org Artificial IntelligenceAug-24-2022

Learning from imbalanced data is a challenging task. Standard classification algorithms tend to perform poorly when trained on imbalanced data. Some special strategies need to be adopted, either by modifying the data distribution or by redesigning the underlying classification algorithm to achieve desirable performance. The prevalence of imbalance in real-world datasets has led to the creation of a multitude of strategies for the class imbalance issue. However, not all the strategies are useful or provide good performance in different imbalance scenarios. There are numerous approaches to dealing with imbalanced data, but the efficacy of such techniques or an experimental comparison among those techniques has not been conducted. In this study, we present a comprehensive analysis of 26 popular sampling techniques to understand their effectiveness in dealing with imbalanced data. Rigorous experiments have been conducted on 50 datasets with different degrees of imbalance to thoroughly investigate the performance of these techniques. A detailed discussion of the advantages and limitations of the techniques, as well as how to overcome such limitations, has been presented. We identify some critical factors that affect the sampling strategies and provide recommendations on how to choose an appropriate sampling technique for a particular application.

algorithm, classifier, dataset, (15 more...)

arXiv.org Artificial Intelligence

2208.11852

Country:

North America > United States > New York > New York County > New York City (0.04)
Asia > Bangladesh > Dhaka Division > Dhaka District > Dhaka (0.04)
North America > United States > Wisconsin (0.04)
South America > Uruguay > Maldonado > Maldonado (0.04)

Genre: Research Report > New Finding (0.67)

Industry: Health & Medicine > Therapeutic Area (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (1.00)

Add feedback

Handling Class Imbalance Problem

#artificialintelligenceJul-20-2020, 08:00:45 GMT

What is the Class Imbalance Problem? Data are said to suffer the Class Imbalance Problem when the class distributions are highly imbalanced. This is a scenario where the number of observations belonging to one class is significantly lower than those belonging to the other classes. Machine Learning algorithms tend to produce unacceptable predictions when faced with imbalanced datasets. Here in this article, we will see some of the techniques on how to handle the Class Imbalance Problem using R. Let's take data where the dependant variable is admission into college based on the independent variables such as GRE score, GPA score, and the ranking of the school.

artificial intelligence, class imbalance problem, machine learning, (15 more...)

#artificialintelligence

Genre: Research Report (0.37)

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Detect Toxic Content to Improve Online Conversations

Mediratta, Deepshi, Oswal, Nikhil

arXiv.org Artificial IntelligenceOct-28-2019

Social media is filled with toxic content. The aim of this paper is to build a model that can detect insincere questions. We use the 'Quora Insincere Questions Classification' dataset for our analysis. The dataset is composed of sincere and insincere questions, with the majority of sincere questions. The dataset is processed and analyzed using Python and its libraries such as sklearn, numpy, pandas, keras etc. The dataset is converted to vector form using word embeddings such as GloVe, Wiki-news and TF-IDF. The imbalance in the dataset is handled by resampling techniques. We train and compare various machine learning and deep learning models to come up with the best results. Models discussed include SVM, Naive Bayes, GRU and LSTM.

dataset, insincere question, undersampling, (14 more...)

arXiv.org Artificial Intelligence

1911.01217

Country:

North America > Canada > Ontario > National Capital Region > Ottawa (0.04)
North America > United States > Oregon > Multnomah County > Portland (0.04)
Europe > Greece > West Greece > Patra (0.04)
(3 more...)

Genre: Research Report (0.50)

Technology:

Information Technology > Communications (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.49)

Add feedback

Dealing with imbalanced data: undersampling, oversampling and proper cross-validation

#artificialintelligenceJan-18-2019, 13:07:24 GMT

Inside the cross-validation loop, get a sample out and do not use it for anything related to features selection, oversampling or model building. Oversample your minority class, without the sample you already excluded. Use the excluded sample for validation, and the oversampled minority class the majority class, to create the model. Repeat n times, where n is your number of samples (if doing leave one participant out cross-validation). Inside the cross-validation loop, get a sample out and do not use it for anything related to features selection, oversampling or model building. Oversample your minority class, without the sample you already excluded. Use the excluded sample for validation, and the oversampled minority class the majority class, to create the model. Repeat n times, where n is your number of samples (if doing leave one participant out cross-validation).

artificial intelligence, machine learning, minority class, (18 more...)

#artificialintelligence

Country: Europe > Slovenia (0.14)

Industry: Health & Medicine (0.69)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Cross Validation (1.00)

Add feedback

Class Imbalance in Credit Card Fraud Detection - Part 3 : Undersampling in Spark

#artificialintelligenceJan-8-2018, 06:58:49 GMT

Whenever we do classification in ML, we often assume that target label is evenly distributed in our dataset. This helps the training algorithm to learn the features as we have enough examples for all the different cases. For example, in learning a spam filter, we should have good amount of data which corresponds to emails which are spam and non spam. This even distribution is not always possible. Let's take an example of fraud detection.

artificial intelligence, credit card fraud detection, machine learning, (8 more...)

#artificialintelligence

Industry: Law Enforcement & Public Safety > Fraud (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback