AITopics | smote

Collaborating Authors

smote

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Concentration and excess risk bounds for imbalanced classification with synthetic oversampling

Neural Information Processing SystemsJun-23-2026, 00:42:59 GMT

Synthetic oversampling of minority examples using SMOTE and its variants is a leading strategy for addressing imbalanced classification problems. Despite the success of this approach in practice, its theoretical foundations remain underexplored. We develop a theoretical framework to analyze the behavior of SMOTE and related methods when classifiers are trained on synthetic data. We first derive a uniform concentration bound on the discrepancy between the empirical risk over synthetic minority samples and the population risk on the true minority distribution. We then provide a nonparametric excess risk guarantee for kernel-based classifiers trained using such synthetic data. These results lead to practical guidelines for better parameter tuning of both SMOTE and the downstream learning algorithm. Numerical experiments are provided to illustrate and support the theoretical findings.

classifier, machine learning, natural language, (20 more...)

Neural Information Processing Systems

Country:

North America > United States (0.46)
Europe (0.46)

Genre: Research Report > Experimental Study (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language (0.93)

Add feedback

LLMs on Tabular Data with Limited Semantics: Evidence from Industrial Car Retrofit Prediction

Pons, Aina Vila, Tzachristas, Ioannis, Antoniou, Constantinos

arXiv.org Machine LearningJun-16-2026

Industrial retrofit planning depends on structured operational data rather than free text: planners must estimate whether a newly registered prototype will require a retrofit, which retrofit package it will need, and how long the work will take. We study an industrial dataset linking a prototype-registration system (284,271 vehicles) with a retrofit-management system (48,716 cleaned visits), and compare strong tabular machine learning baselines with three LLM-based strategies on row-serialized inputs: embedding features (Amazon Titan), direct prompted classification (Claude Sonnet 4), and an ML+LLM stacking approach. Across binary occurrence prediction, 15-way retrofit-type classification, per-visit duration regression, and an aggregated monthly benchmark, classical tree ensembles remain the strongest standalone models. However, the LLM results reveal a consistent pattern: embeddings remain useful on tables (binary AUC = 0.982), direct prompting collapses once semantic signal is stripped by hashing (binary AUC = 0.500; multiclass weighted F1 = 0.018), and hybrid stacking yields the best manually built multiclass model (weighted F1 = 0.626). On the monthly benchmark, lag-based machine learning outperforms time-series foundation models, though Chronos-small remains competitive in zero-shot forecasting. The results suggest that on privacy-constrained industrial tables, LLMs are more effective as complementary components than as replacements for strong tabular baselines.

benchmark, large language model, natural language, (20 more...)

arXiv.org Machine Learning

2606.15314

Country:

North America > United States (0.16)
Europe > Germany (0.15)

Genre: Research Report (0.64)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

Classification Imbalance as Transfer Learning

Xia, Eric, Klusowski, Jason M.

arXiv.org Machine LearningJan-16-2026

Classification imbalance arises when one class is much rarer than the other. We frame this setting as transfer learning under label (prior) shift between an imbalanced source distribution induced by the observed data and a balanced target distribution under which performance is evaluated. Within this framework, we study a family of oversampling procedures that augment the training data by generating synthetic samples from an estimated minority-class distribution to roughly balance the classes, among which the celebrated SMOTE algorithm is a canonical example. We show that the excess risk decomposes into the rate achievable under balanced training (as if the data had been drawn from the balanced target distribution) and an additional term, the cost of transfer, which quantifies the discrepancy between the estimated and true minority-class distributions. In particular, we show that the cost of transfer for SMOTE dominates that of bootstrapping (random oversampling) in moderately high dimensions, suggesting that we should expect bootstrapping to have better performance than SMOTE in general. We corroborate these findings with experimental evidence. More broadly, our results provide guidance for choosing among augmentation strategies for imbalanced classification.

artificial intelligence, machine learning, probability, (16 more...)

arXiv.org Machine Learning

2601.1063

Country: North America > United States (0.27)

Genre: Research Report > New Finding (0.48)

Industry: Health & Medicine > Therapeutic Area (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Transfer Learning (0.61)

Add feedback

A Hybrid Deep Learning and Anomaly Detection Framework for Real-Time Malicious URL Classification

Khaled, Berkani, Rafik, Zeraoulia

arXiv.org Artificial IntelligenceDec-4-2025

The number and sophistication of cyberthreats have increased along with the internet's exponential expansion, especially those that are spread by bad URLs. A variety of assaults, such as phishing, drive-by downloads, command-and-control communications, and data exfiltration, are launched using malicious websites. Because attackers are constantly changing URLs to avoid detection, traditional blacklisting techniques are unable to keep up with the dynamic and hostile character of contemporary threats. As a result, intelligent algorithms that can recognize intricate patterns in URLs and instantly identify malicious ones have become crucial components of contemporary cybersecurity protection designs [1, 13]. Because machine learning (ML) and deep learning (DL) approaches can identify non-linear relationships in input data and generalize from observed patterns, they have shown considerable promise in the field of malicious URL detection [2, 3]. But there are still a number of obstacles to overcome: class imbalance (lack of labeled malicious data compared to benign URLs); attackers' adversarial techniques that produce highly obfuscated or anomalous URLs that undermine the effectiveness of traditional classifiers; and the majority of detection systems are restricted to monolingual user interfaces and lack real-time usability features.

data mining, detection, machine learning, (15 more...)

arXiv.org Artificial Intelligence

2512.03462

Genre: Research Report (0.66)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Data Science > Data Mining > Anomaly Detection (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Augmenting The Weather: A Hybrid Counterfactual-SMOTE Algorithm for Improving Crop Growth Prediction When Climate Changes

Temraz, Mohammed, Keane, Mark T

arXiv.org Artificial IntelligenceNov-18-2025

In recent years, humanity has begun to experien ce the catastrophic effects of climate change as economic sectors (such as agriculture) struggle with unpredictable and extreme weather events. Artificial Intelligence (AI) should help us handle these climate challenges but its most promising solutions are not good at dealing with climate - disrupted data; specifically, machine learning methods that work from historical data - distributions, are not good at handling out - of - distribution, outlier events. In this paper, we propose a novel data augmentation method, that treats the predictive problems around climate change as being, in part, due to class - imbalance issues; that is, prediction from historical datasets is difficult because, by definition, they lack sufficient minority - class instances of "climate outlier events". This novel data augmentation method -- called Counterfactual - Based SMOTE (CFA - SMOTE) -- combines an instance - based counterfactual method from Explainable AI (XAI) with the well - known class - imbalance method, SMOTE. CFA - SMOTE creates synthetic dat a - points representing outlier, climate - events that augment the dataset to improve predictive performance. We report comparative experiments using this CFA - SMOTE method, comparing it to benchmark counterfactual and class - imbalance methods under different co nditions (i.e., class - imbalance ratios). The focal climate - change domain used relies on predicting grass growth on Irish dairy farms, during Europe - wide drought and forage crisis of 2018.

artificial intelligence, machine learning, smote, (17 more...)

arXiv.org Artificial Intelligence

2511.11945

Country:

North America > United States (0.46)
Europe > Ireland (0.28)

Genre:

Research Report > New Finding (0.46)
Research Report > Promising Solution (0.34)

Industry:

Health & Medicine (1.00)
Government (1.00)
Food & Agriculture > Agriculture (0.87)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)

Add feedback

Data Leakage and Deceptive Performance: A Critical Examination of Credit Card Fraud Detection Methodologies

Hayat, Khizar, Magnier, Baptiste

arXiv.org Artificial IntelligenceNov-11-2025

This study critically examines the methodological rigor in credit card fraud detection research, revealing how fundamental evaluation flaws can overshadow algorithmic sophistication. Through deliberate experimentation with improper evaluation protocols, we demonstrate that even simple models can achieve deceptively impressive results when basic methodological principles are violated. Our analysis identifies four critical issues plaguing current approaches: (1) pervasive data leakage from improper preprocessing sequences, (2) intentional vagueness in methodological reporting, (3) inadequate temporal validation for transaction data, and (4) metric manipulation through recall optimization at precision's expense. We present a case study showing how a minimal neural network architecture with data leakage outperforms many sophisticated methods reported in literature, achieving 99.9\% recall despite fundamental evaluation flaws. These findings underscore that proper evaluation methodology matters more than model complexity in fraud detection research. The study serves as a cautionary example of how methodological rigor must precede architectural sophistication, with implications for improving research practices across machine learning applications.

artificial intelligence, deep learning, machine learning, (12 more...)

arXiv.org Artificial Intelligence

doi: 10.3390/math13162563

2506.02703

Country: Europe (0.46)

Genre: Research Report > New Finding (1.00)

Industry:

Law Enforcement & Public Safety > Fraud (1.00)
Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Finding the Sweet Spot: Optimal Data Augmentation Ratio for Imbalanced Credit Scoring Using ADASYN

Chia, Luis H.

arXiv.org Artificial IntelligenceOct-22-2025

Credit scoring models face a critical challenge: severe class imbalance, with default rates typically below 10%, which hampers model learning and predictive performance. While synthetic data augmentation techniques such as SMOTE and ADASYN have been proposed to address this issue, the optimal augmentation ratio remains unclear, with practitioners often defaulting to full balancing (1:1 ratio) without empirical justification. This study systematically evaluates 10 data augmentation scenarios using the Give Me Some Credit dataset (97,243 observations, 7% default rate), comparing SMOTE, BorderlineSMOTE, and ADASYN at different multiplication factors (1x, 2x, 3x). All models were trained using XGBoost and evaluated on a held-out test set of 29,173 real observations. Statistical significance was assessed using bootstrap testing with 1,000 iterations. Key findings reveal that ADASYN with 1x multiplication (doubling the minority class) achieved optimal performance with AUC of 0.6778 and Gini coefficient of 0.3557, representing statistically significant improvements of +0.77% and +3.00% respectively (p = 0.017, bootstrap test). Higher multiplication factors (2x and 3x) resulted in performance degradation, with 3x showing a -0.48% decrease in AUC, suggesting a "law of diminishing returns" for synthetic oversampling. The optimal class imbalance ratio was found to be 6.6:1 (majority:minority), contradicting the common practice of balancing to 1:1. This work provides the first empirical evidence of an optimal "sweet spot" for data augmentation in credit scoring, with practical guidelines for industry practitioners and researchers working with imbalanced datasets. While demonstrated on a single representative dataset, the methodology provides a reproducible framework for determining optimal augmentation ratios in other imbalanced domains.

adasyn, artificial intelligence, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2510.18252

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry: Banking & Finance > Credit (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.94)

Add feedback

Vehicle Classification under Extreme Imbalance: A Comparative Study of Ensemble Learning and CNNs

Syarubany, Abu Hanif Muhammad

arXiv.org Artificial IntelligenceSep-30-2025

We curate a 16 - class corpus (~47k images) by merging Kaggle, ImageNet, and web - cr awled data, and create six balanced variants via SMOTE oversampling and targeted undersampling. Lightweight ensembles, such as Random Forest, AdaBoost, and a soft - voting combiner built on MobileNet - V2 features are benchmarked against a configurable ResNet - style CNN trained with strong augmentation and label smoothing. The best ensemble (SMOTE - combined) attains 74.8% test accuracy, while the CNN achieves 79.19% on the full test set and 81.25% on an unseen inferen ce batch, confirming the advantage of deep models. Nonetheless, the most under - represented class (Barge) remains a failure mode, highlighting the limits of rebalancing alone. Results suggest prioritizing additional minority - class collection and cost - sensit ive objectives (e.g., focal loss) and exploring hybrid ensemble or CNN pipelines to combine interpretability with representational power. The best ensemble (SMOTE - combined) reached 74.8% test accuracy, while the final checkpoint of CNN achieved 79.1 9 % on the full test set and 81. 25 % on an unseen EE531 inference batch, confirming that deep models excel overall but still falter on the most under - represented class ( Barge), underscoring the persistent challenge of extreme imbalance.

accuracy, artificial intelligence, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2509.2488

Genre: Research Report > New Finding (0.48)

Industry: Transportation (0.48)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)

Add feedback

Oversampling and Downsampling with Core-Boundary Awareness: A Data Quality-Driven Approach

Belhaouari, Samir Brahim, Kahalan, Yunis Carreon, Shaffique, Humaira, Belhaouari, Ismael, Islam, Ashhadul

arXiv.org Artificial IntelligenceSep-25-2025

The effectiveness of machine learning models, particularly in unbalanced classification tasks, is often hindered by the failure to differentiate between critical instances near the decision boundary and redundant samples concentrated in the core of the data distribution. In this paper, we propose a method to systematically identify and differentiate between these two types of data. Through extensive experiments on multiple benchmark datasets, we show that the boundary data oversampling method improves the F1 score by up to 10\% on 96\% of the datasets, whereas our core-aware reduction method compresses datasets up to 90\% while preserving their accuracy, making it 10 times more powerful than the original dataset. Beyond imbalanced classification, our method has broader implications for efficient model training, particularly in computationally expensive domains such as Large Language Model (LLM) training. By prioritizing high-quality, decision-relevant data, our approach can be extended to text, multimodal, and self-supervised learning scenarios, offering a pathway to faster convergence, improved generalization, and significant computational savings. This work paves the way for future research in data-efficient learning, where intelligent sampling replaces brute-force expansion, driving the next generation of AI advancements. Our code is available as a Python package at https://pypi.org/project/adaptive-resampling/ .

artificial intelligence, dataset, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2509.19856

Country: Europe (1.00)

Genre: Research Report > New Finding (0.93)

Industry: Health & Medicine (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.46)

Add feedback

Improving Credit Card Fraud Detection through Transformer-Enhanced GAN Oversampling

Emaan, Kashaf Ul

arXiv.org Artificial IntelligenceSep-25-2025

Detection of credit card fraud is an acute issue of financial security because transaction datasets are highly lopsided, with fraud cases being only a drop in the ocean. Balancing datasets using the most popular methods of traditional oversampling such as the Synthetic Minority Oversampling Technique (SMOTE) generally create simplistic synthetic samples that are not readily applicable to complex fraud patterns. Recent industry advances that include Conditional Tabular Generative Adversarial Networks (CTGAN) and Tabular Variational Autoencoders (TVAE) have demonstrated increased efficiency in tabular synthesis, yet all these models still exhibit issues with high-dimensional dependence modelling. Now we will present our hybrid approach where we use a Generative Adversarial Network (GAN) with a Transformer encoder block to produce realistic fraudulent transactions samples. The GAN architecture allows training realistic generators adversarial, and the Transformer allows the model to learn rich feature interactions by self-attention. Such a hybrid strategy overcomes the limitations of SMOTE, CTGAN, and TVAE by producing a variety of high-quality synthetic minority classes samples. We test our algorithm on the publicly-available Credit Card Fraud Detection dataset and compare it to conventional and generative resampling strategies with a variety of classifiers, such as Logistic Regression (LR), Random Forest (RF), Extreme Gradient Boosting (XGBoost), and Support Vector Machine (SVM). Findings indicate that our Transformer-based GAN shows substantial gains in Recall, F1-score and Area Under the Receiver Operating Characteristic Curve (AUC), which indicates that it is effective in overcoming the severe class imbalance inherent in the task of fraud detection.

artificial intelligence, deep learning, machine learning, (16 more...)

arXiv.org Artificial Intelligence

2509.19032

Country: North America (0.46)

Genre: Research Report > New Finding (1.00)

Industry:

Law Enforcement & Public Safety > Fraud (1.00)
Banking & Finance (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback