AITopics | oversampling

Collaborating Authors

oversampling

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Uncertainty-Aware Generative Oversampling Using an Entropy-Guided Conditional Variational Autoencoder

Zare, Amirhossein, Zare, Amirhessam, Pezeshki, Parmida Sadat, Herlock, null, Rahimi, null, Ebrahimi, Ali, Vázquez-García, Ignacio, Celi, Leo Anthony

arXiv.org Artificial IntelligenceOct-7-2025

Class imbalance remains a major challenge in machine learning, especially for high-dimensional biomedical data where nonlinear manifold structures dominate. Traditional oversampling methods such as SMOTE rely on local linear interpolation, often producing implausible synthetic samples. Deep generative models like Conditional Variational Autoencoders (CVAEs) better capture nonlinear distributions, but standard variants treat all minority samples equally, neglecting the importance of uncertain, boundary-region examples emphasized by heuristic methods like Borderline-SMOTE and ADASYN. We propose Local Entropy-Guided Oversampling with a CVAE (LEO-CVAE), a generative oversampling framework that explicitly incorporates local uncertainty into both representation learning and data generation. To quantify uncertainty, we compute Shannon entropy over the class distribution in a sample's neighborhood: high entropy indicates greater class overlap, serving as a proxy for uncertainty. LEO-CVAE leverages this signal through two mechanisms: (i) a Local Entropy-Weighted Loss (LEWL) that emphasizes robust learning in uncertain regions, and (ii) an entropy-guided sampling strategy that concentrates generation in these informative, class-overlapping areas. Applied to clinical genomics datasets (ADNI and TCGA lung cancer), LEO-CVAE consistently improves classifier performance, outperforming both traditional oversampling and generative baselines. These results highlight the value of uncertainty-aware generative oversampling for imbalanced learning in domains governed by complex nonlinear structures, such as omics data.

artificial intelligence, deep learning, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2509.25334

Country:

North America > United States > Massachusetts (0.28)
North America > United States > California (0.28)

Genre:

Research Report > Experimental Study (0.68)
Research Report > New Finding (0.67)

Industry:

Health & Medicine > Therapeutic Area > Oncology (1.00)
Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
Health & Medicine > Therapeutic Area > Neurology > Alzheimer's Disease (0.47)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)

Add feedback

Comparative Analysis of Stroke Prediction Models Using Machine Learning

Tashkova, Anastasija, Eftimov, Stefan, Ristov, Bojan, Kalajdziski, Slobodan

arXiv.org Artificial IntelligenceMay-16-2025

This study underscores the potential of machine learning in stroke risk prediction, addressing key challenges such as class imbalance and feature selection. Our findings highlight the significant role of demographic, clinical, and lifestyle factors, with age, average glucose level, and BMI emerging as key predictors. Notably, the analysis reveals the importance of age-specific models, as the predictive influence of factors shifts across different age groups. In elderly patients (65-80 years), work type and glucose level become more influential, while hypertension and heart disease gain prominence. By achieving high predictive accuracy and identifying im-pactful features, this research supports the development of stroke risk assessment tools with potential integration into clinical decision systems. These tools could assist clinicians in early intervention planning and personalized prevention strategies. However, challenges such as data variability, model interpretability, and deployment in real-world healthcare settings remain. Future research should focus on improving model sensitivity, incorporating diverse datasets, and validating predictions in clinical environments. Advancing these models could enhance early detection strategies, ultimately improving patient outcomes and stroke prevention.

artificial intelligence, machine learning, random forest, (14 more...)

arXiv.org Artificial Intelligence

2505.09812

Country: Europe > North Macedonia (0.14)

Genre: Research Report > New Finding (1.00)

Industry: Health & Medicine > Therapeutic Area > Cardiology/Vascular Diseases (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.58)
Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (0.41)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis (0.35)

Add feedback

Enhancing Synthetic Oversampling for Imbalanced Datasets Using Proxima-Orion Neighbors and q-Gaussian Weighting Technique

Yadav, Pankaj, Vijay, Vivek, Sihag, Gulshan

arXiv.org Machine LearningJan-27-2025

In this article, we propose a novel oversampling algorithm to increase the number of instances of minority class in an imbalanced dataset. We select two instances, Proxima and Orion, from the set of all minority class instances, based on a combination of relative distance weights and density estimation of majority class instances. Furthermore, the q-Gaussian distribution is used as a weighting mechanism to produce new synthetic instances to improve the representation and diversity. We conduct a comprehensive experiment on 42 datasets extracted from KEEL software and eight datasets from the UCI ML repository to evaluate the usefulness of the proposed (PO-QG) algorithm. Wilcoxon signed-rank test is used to compare the proposed algorithm with five other existing algorithms. The test results show that the proposed technique improves the overall classification performance. We also demonstrate the PO-QG algorithm to a dataset of Indian patients with sarcopenia.

artificial intelligence, data mining, machine learning, (18 more...)

arXiv.org Machine Learning

2501.1579

Country:

North America > United States > Wisconsin (0.05)
Asia > India (0.04)
South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
(3 more...)

Genre:

Research Report > Experimental Study (0.68)
Research Report > New Finding (0.48)

Industry: Health & Medicine > Therapeutic Area (0.46)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
(2 more...)

Add feedback

Imbalanced Graph Classification with Multi-scale Oversampling Graph Neural Networks

Ma, Rongrong, Pang, Guansong, Chen, Ling

arXiv.org Artificial IntelligenceMay-17-2024

One main challenge in imbalanced graph classification is to learn expressive representations of the graphs in under-represented (minority) classes. Existing generic imbalanced learning methods, such as oversampling and imbalanced learning loss functions, can be adopted for enabling graph representation learning models to cope with this challenge. However, these methods often directly operate on the graph representations, ignoring rich discriminative information within the graphs and their interactions. To tackle this issue, we introduce a novel multi-scale oversampling graph neural network (MOSGNN) that learns expressive minority graph representations based on intra- and inter-graph semantics resulting from oversampled graphs at multiple scales - subgraph, graph, and pairwise graphs. It achieves this by jointly optimizing subgraph-level, graph-level, and pairwise-graph learning tasks to learn the discriminative information embedded within and between the minority graphs. Extensive experiments on 16 imbalanced graph datasets show that MOSGNN i) significantly outperforms five state-of-the-art models, and ii) offers a generic framework, in which different advanced imbalanced learning loss functions can be easily plugged in and obtain significantly improved classification performance.

classification, graph, mosgnn, (15 more...)

arXiv.org Artificial Intelligence

2405.04903

Country:

Asia > Singapore (0.04)
Oceania > Australia > New South Wales > Sydney (0.04)
North America > United States > California (0.04)

Genre: Research Report > Promising Solution (0.34)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

GANsemble for Small and Imbalanced Data Sets: A Baseline for Synthetic Microplastics Data

Platnick, Daniel, Khanzadeh, Sourena, Sadeghian, Alireza, Valenzano, Richard Anthony

arXiv.org Artificial IntelligenceApr-30-2024

Microplastic particle ingestion or inhalation by humans is a problem of growing concern. Unfortunately, current research methods that use machine learning to understand their potential harms are obstructed by a lack of available data. Deep learning techniques in particular are challenged by such domains where only small or imbalanced data sets are available. Overcoming this challenge often involves oversampling underrepresented classes or augmenting the existing data to improve model performance. This paper proposes GANsemble: a two-module framework connecting data augmentation with conditional generative adversarial networks (cGANs) to generate class-conditioned synthetic data. First, the data chooser module automates augmentation strategy selection by searching for the best data augmentation strategy. Next, the cGAN module uses this strategy to train a cGAN for generating enhanced synthetic data. We experiment with the GANsemble framework on a small and imbalanced microplastics data set. A Microplastic-cGAN (MPcGAN) algorithm is introduced, and baselines for synthetic microplastics (SYMP) data are established in terms of Frechet Inception Distance (FID) and Inception Scores (IS). We also provide a synthetic microplastics filter (SYMP-Filter) algorithm to increase the quality of generated SYMP. Additionally, we show the best amount of oversampling with augmentation to fix class imbalance in small microplastics data sets. To our knowledge, this study is the first application of generative AI to synthetically create microplastics data.

augmentation strategy, data chooser module, microplastic data, (13 more...)

arXiv.org Artificial Intelligence

2404.07356

Country:

North America > Canada > Ontario > Toronto (0.14)
North America > United States > California > San Diego County > San Diego (0.04)
Europe > France (0.04)

Genre:

Research Report (0.64)
Overview (0.48)

Industry: Health & Medicine (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

The Effect of Balancing Methods on Model Behavior in Imbalanced Classification Problems

Stando, Adrian, Cavus, Mustafa, Biecek, Przemysław

arXiv.org Artificial IntelligenceJun-30-2023

Imbalanced data poses a significant challenge in classification as model performance is affected by insufficient learning from minority classes. Balancing methods are often used to address this problem. However, such techniques can lead to problems such as overfitting or loss of information. This study addresses a more challenging aspect of balancing methods - their impact on model behavior. To capture these changes, Explainable Artificial Intelligence tools are used to compare models trained on datasets before and after balancing. In addition to the variable importance method, this study uses the partial dependence profile and accumulated local effects techniques. Real and simulated datasets are tested, and an open-source Python package edgaro is developed to facilitate this analysis. The results obtained show significant changes in model behavior due to balancing methods, which can lead to biased models toward a balanced distribution. These findings confirm that balancing analysis should go beyond model performance comparisons to achieve higher reliability of machine learning models. Therefore, we propose a new method performance gain plot for informed data balancing strategy to make an optimal selection of balancing method by analyzing the measure of change in model behavior versus performance gain.

artificial intelligence, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2307.00157

Country:

Europe > Poland > Masovia Province > Warsaw (0.05)
Asia > Middle East > Republic of Türkiye > Eskisehir Province > Eskisehir (0.05)
South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
(2 more...)

Genre: Research Report > New Finding (1.00)

Industry: Health & Medicine (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Expert Systems (0.46)
Information Technology > Artificial Intelligence > Natural Language > Explanation & Argumentation (0.34)
Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (0.34)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.34)

Add feedback

Stop Oversampling for Class Imbalance Learning: A Critical Review

Hassanat, Ahmad B., Tarawneh, Ahmad S., Altarawneh, Ghada A.

arXiv.org Artificial IntelligenceFeb-4-2022

For the last two decades, oversampling has been employed to overcome the challenge of learning from imbalanced datasets. Many approaches to solving this challenge have been offered in the literature. Oversampling, on the other hand, is a concern. That is, models trained on fictitious data may fail spectacularly when put to real-world problems. The fundamental difficulty with oversampling approaches is that, given a real-life population, the synthesized samples may not truly belong to the minority class. As a result, training a classifier on these samples while pretending they represent minority may result in incorrect predictions when the model is used in the real world. We analyzed a large number of oversampling methods in this paper and devised a new oversampling evaluation system based on hiding a number of majority examples and comparing them to those generated by the oversampling process. Based on our evaluation system, we ranked all these methods based on their incorrectly generated examples for comparison. Our experiments using more than 70 oversampling methods and three imbalanced real-world datasets reveal that all oversampling methods studied generate minority samples that are most likely to be majority. Given data and methods in hand, we argue that oversampling in its current forms and methodologies is unreliable for learning from class imbalanced data and should be avoided in real-world applications.

dataset, imbalanced data, international conference, (15 more...)

arXiv.org Artificial Intelligence

2202.03579

Country:

Asia > Middle East > Jordan (0.04)
Europe > Hungary > Budapest > Budapest (0.04)
Asia > Middle East > Saudi Arabia (0.04)

Genre: Research Report > New Finding (0.67)

Industry:

Information Technology > Security & Privacy (1.00)
Health & Medicine > Therapeutic Area (1.00)
Transportation > Ground > Road (0.67)
(2 more...)

Technology:

Information Technology > Sensing and Signal Processing (1.00)
Information Technology > Security & Privacy (1.00)
Information Technology > Information Management (1.00)
(8 more...)

Add feedback

Integrating Unsupervised Clustering and Label-specific Oversampling to Tackle Imbalanced Multi-label Data

Sadhukhan, Payel, Pakrashi, Arjun, Palit, Sarbani, Mac Namee, Brian

arXiv.org Artificial IntelligenceSep-25-2021

There is often a mixture of very frequent labels and very infrequent labels in multi-label datatsets. This variation in label frequency, a type class imbalance, creates a significant challenge for building efficient multi-label classification algorithms. In this paper, we tackle this problem by proposing a minority class oversampling scheme, UCLSO, which integrates Unsupervised Clustering and Label-Specific data Oversampling. Clustering is performed to find out the key distinct and locally connected regions of a multi-label dataset (irrespective of the label information). Next, for each label, we explore the distributions of minority points in the cluster sets. Only the minority points within a cluster are used to generate the synthetic minority points that are used for oversampling. Even though the cluster set is the same across all labels, the distributions of the synthetic minority points will vary across the labels. The training dataset is augmented with the set of label-specific synthetic minority points, and classifiers are trained to predict the relevance of each label independently. Experiments using 12 multi-label datasets and several multi-label algorithms show that the proposed method performed very well compared to the other competing algorithms.

artificial intelligence, data mining, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2109.12421

Country:

Asia > India > West Bengal > Kolkata (0.04)
North America > United States > New York > New York County > New York City (0.04)
Europe > Ireland > Leinster > County Dublin > Dublin (0.04)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.71)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Support Vector Machines (0.46)

Add feedback

Overcoming an Imbalanced Dataset using Oversampling.

#artificialintelligenceJul-13-2020, 02:57:34 GMT

How oversampling yielded great results for classifying cases of Sexual Harassment. When it comes to data science, sexual harassment is an imbalanced data problem, meaning there are few (known) instances of harassment in the entire dataset. An imbalanced problem is defined as a dataset which has disproportional class counts. Oversampling is one way to combat this by creating synthetic minority samples. SMOTE -- Synthetic Minority Over-sampling Technique -- is a common oversampling method widely used in machine learning with imbalanced high-dimensional datasets using Oversampling.

artificial intelligence, machine learning, oversampling, (14 more...)

#artificialintelligence

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Nearest Neighbor Methods (0.36)

Add feedback

Oversampling for Imbalanced Time Series Data

Zhu, Tuanfei, Lin, Yaping, Liu, Yonghe

arXiv.org Machine LearningApr-14-2020

Many important real-world applications involve time-series data with skewed distribution. Compared to conventional imbalance learning problems, the classification of imbalanced time-series data is more challenging due to high dimensionality and high inter-variable correlation. This paper proposes a structure preserving Oversampling method to combat the High-dimensional Imbalanced Time-series classification (OHIT). OHIT first leverages a density-ratio based shared nearest neighbor clustering algorithm to capture the modes of minority class in high-dimensional space. It then for each mode applies the shrinkage technique of large-dimensional covariance matrix to obtain accurate and reliable covariance structure. Finally, OHIT generates the structure-preserving synthetic samples based on multivariate Gaussian distribution by using the estimated covariance matrices. Experimental results on several publicly available time-series datasets (including unimodal and multi-modal) demonstrate the superiority of OHIT against the state-of-the-art oversampling algorithms in terms of F-value, G-mean, and AUC.

covariance matrix, minority class, ohit, (16 more...)

arXiv.org Machine Learning

2004.06373

Country:

North America > United States > California > San Francisco County > San Francisco (0.14)
North America > United States > District of Columbia > Washington (0.05)
North America > United States > Texas (0.04)
(3 more...)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.66)

Add feedback