smote
Classification Imbalance as Transfer Learning
Xia, Eric, Klusowski, Jason M.
Classification imbalance arises when one class is much rarer than the other. We frame this setting as transfer learning under label (prior) shift between an imbalanced source distribution induced by the observed data and a balanced target distribution under which performance is evaluated. Within this framework, we study a family of oversampling procedures that augment the training data by generating synthetic samples from an estimated minority-class distribution to roughly balance the classes, among which the celebrated SMOTE algorithm is a canonical example. We show that the excess risk decomposes into the rate achievable under balanced training (as if the data had been drawn from the balanced target distribution) and an additional term, the cost of transfer, which quantifies the discrepancy between the estimated and true minority-class distributions. In particular, we show that the cost of transfer for SMOTE dominates that of bootstrapping (random oversampling) in moderately high dimensions, suggesting that we should expect bootstrapping to have better performance than SMOTE in general. We corroborate these findings with experimental evidence. More broadly, our results provide guidance for choosing among augmentation strategies for imbalanced classification.
- North America > United States > New Jersey > Mercer County > Princeton (0.40)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
A Hybrid Deep Learning and Anomaly Detection Framework for Real-Time Malicious URL Classification
Khaled, Berkani, Rafik, Zeraoulia
The number and sophistication of cyberthreats have increased along with the internet's exponential expansion, especially those that are spread by bad URLs. A variety of assaults, such as phishing, drive-by downloads, command-and-control communications, and data exfiltration, are launched using malicious websites. Because attackers are constantly changing URLs to avoid detection, traditional blacklisting techniques are unable to keep up with the dynamic and hostile character of contemporary threats. As a result, intelligent algorithms that can recognize intricate patterns in URLs and instantly identify malicious ones have become crucial components of contemporary cybersecurity protection designs [1, 13]. Because machine learning (ML) and deep learning (DL) approaches can identify non-linear relationships in input data and generalize from observed patterns, they have shown considerable promise in the field of malicious URL detection [2, 3]. But there are still a number of obstacles to overcome: class imbalance (lack of labeled malicious data compared to benign URLs); attackers' adversarial techniques that produce highly obfuscated or anomalous URLs that undermine the effectiveness of traditional classifiers; and the majority of detection systems are restricted to monolingual user interfaces and lack real-time usability features.
Augmenting The Weather: A Hybrid Counterfactual-SMOTE Algorithm for Improving Crop Growth Prediction When Climate Changes
Temraz, Mohammed, Keane, Mark T
In recent years, humanity has begun to experien ce the catastrophic effects of climate change as economic sectors (such as agriculture) struggle with unpredictable and extreme weather events. Artificial Intelligence (AI) should help us handle these climate challenges but its most promising solutions are not good at dealing with climate - disrupted data; specifically, machine learning methods that work from historical data - distributions, are not good at handling out - of - distribution, outlier events. In this paper, we propose a novel data augmentation method, that treats the predictive problems around climate change as being, in part, due to class - imbalance issues; that is, prediction from historical datasets is difficult because, by definition, they lack sufficient minority - class instances of "climate outlier events". This novel data augmentation method -- called Counterfactual - Based SMOTE (CFA - SMOTE) -- combines an instance - based counterfactual method from Explainable AI (XAI) with the well - known class - imbalance method, SMOTE. CFA - SMOTE creates synthetic dat a - points representing outlier, climate - events that augment the dataset to improve predictive performance. We report comparative experiments using this CFA - SMOTE method, comparing it to benchmark counterfactual and class - imbalance methods under different co nditions (i.e., class - imbalance ratios). The focal climate - change domain used relies on predicting grass growth on Irish dairy farms, during Europe - wide drought and forage crisis of 2018.
- North America > United States > Massachusetts > Suffolk County > Boston (0.04)
- Europe > Sweden (0.04)
- Europe > Norway (0.04)
- (3 more...)
- Research Report > New Finding (0.46)
- Research Report > Promising Solution (0.34)
- Health & Medicine (1.00)
- Government (1.00)
- Food & Agriculture > Agriculture (0.87)
Data Leakage and Deceptive Performance: A Critical Examination of Credit Card Fraud Detection Methodologies
Hayat, Khizar, Magnier, Baptiste
This study critically examines the methodological rigor in credit card fraud detection research, revealing how fundamental evaluation flaws can overshadow algorithmic sophistication. Through deliberate experimentation with improper evaluation protocols, we demonstrate that even simple models can achieve deceptively impressive results when basic methodological principles are violated. Our analysis identifies four critical issues plaguing current approaches: (1) pervasive data leakage from improper preprocessing sequences, (2) intentional vagueness in methodological reporting, (3) inadequate temporal validation for transaction data, and (4) metric manipulation through recall optimization at precision's expense. We present a case study showing how a minimal neural network architecture with data leakage outperforms many sophisticated methods reported in literature, achieving 99.9\% recall despite fundamental evaluation flaws. These findings underscore that proper evaluation methodology matters more than model complexity in fraud detection research. The study serves as a cautionary example of how methodological rigor must precede architectural sophistication, with implications for improving research practices across machine learning applications.
- Europe > France > Occitanie > Hérault > Montpellier (0.04)
- Europe > Croatia > Dubrovnik-Neretva County > Dubrovnik (0.04)
- Asia > Middle East > Oman > Ad Dakhiliyah Governorate > Nizwa (0.04)
- Law Enforcement & Public Safety > Fraud (1.00)
- Information Technology > Security & Privacy (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Finding the Sweet Spot: Optimal Data Augmentation Ratio for Imbalanced Credit Scoring Using ADASYN
Credit scoring models face a critical challenge: severe class imbalance, with default rates typically below 10%, which hampers model learning and predictive performance. While synthetic data augmentation techniques such as SMOTE and ADASYN have been proposed to address this issue, the optimal augmentation ratio remains unclear, with practitioners often defaulting to full balancing (1:1 ratio) without empirical justification. This study systematically evaluates 10 data augmentation scenarios using the Give Me Some Credit dataset (97,243 observations, 7% default rate), comparing SMOTE, BorderlineSMOTE, and ADASYN at different multiplication factors (1x, 2x, 3x). All models were trained using XGBoost and evaluated on a held-out test set of 29,173 real observations. Statistical significance was assessed using bootstrap testing with 1,000 iterations. Key findings reveal that ADASYN with 1x multiplication (doubling the minority class) achieved optimal performance with AUC of 0.6778 and Gini coefficient of 0.3557, representing statistically significant improvements of +0.77% and +3.00% respectively (p = 0.017, bootstrap test). Higher multiplication factors (2x and 3x) resulted in performance degradation, with 3x showing a -0.48% decrease in AUC, suggesting a "law of diminishing returns" for synthetic oversampling. The optimal class imbalance ratio was found to be 6.6:1 (majority:minority), contradicting the common practice of balancing to 1:1. This work provides the first empirical evidence of an optimal "sweet spot" for data augmentation in credit scoring, with practical guidelines for industry practitioners and researchers working with imbalanced datasets. While demonstrated on a single representative dataset, the methodology provides a reproducible framework for determining optimal augmentation ratios in other imbalanced domains.
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
Vehicle Classification under Extreme Imbalance: A Comparative Study of Ensemble Learning and CNNs
We curate a 16 - class corpus (~47k images) by merging Kaggle, ImageNet, and web - cr awled data, and create six balanced variants via SMOTE oversampling and targeted undersampling. Lightweight ensembles, such as Random Forest, AdaBoost, and a soft - voting combiner built on MobileNet - V2 features are benchmarked against a configurable ResNet - style CNN trained with strong augmentation and label smoothing. The best ensemble (SMOTE - combined) attains 74.8% test accuracy, while the CNN achieves 79.19% on the full test set and 81.25% on an unseen inferen ce batch, confirming the advantage of deep models. Nonetheless, the most under - represented class (Barge) remains a failure mode, highlighting the limits of rebalancing alone. Results suggest prioritizing additional minority - class collection and cost - sensit ive objectives (e.g., focal loss) and exploring hybrid ensemble or CNN pipelines to combine interpretability with representational power. The best ensemble (SMOTE - combined) reached 74.8% test accuracy, while the final checkpoint of CNN achieved 79.1 9 % on the full test set and 81. 25 % on an unseen EE531 inference batch, confirming that deep models excel overall but still falter on the most under - represented class ( Barge), underscoring the persistent challenge of extreme imbalance.
Improving Credit Card Fraud Detection through Transformer-Enhanced GAN Oversampling
Detection of credit card fraud is an acute issue of financial security because transaction datasets are highly lopsided, with fraud cases being only a drop in the ocean. Balancing datasets using the most popular methods of traditional oversampling such as the Synthetic Minority Oversampling Technique (SMOTE) generally create simplistic synthetic samples that are not readily applicable to complex fraud patterns. Recent industry advances that include Conditional Tabular Generative Adversarial Networks (CTGAN) and Tabular Variational Autoencoders (TVAE) have demonstrated increased efficiency in tabular synthesis, yet all these models still exhibit issues with high-dimensional dependence modelling. Now we will present our hybrid approach where we use a Generative Adversarial Network (GAN) with a Transformer encoder block to produce realistic fraudulent transactions samples. The GAN architecture allows training realistic generators adversarial, and the Transformer allows the model to learn rich feature interactions by self-attention. Such a hybrid strategy overcomes the limitations of SMOTE, CTGAN, and TVAE by producing a variety of high-quality synthetic minority classes samples. We test our algorithm on the publicly-available Credit Card Fraud Detection dataset and compare it to conventional and generative resampling strategies with a variety of classifiers, such as Logistic Regression (LR), Random Forest (RF), Extreme Gradient Boosting (XGBoost), and Support Vector Machine (SVM). Findings indicate that our Transformer-based GAN shows substantial gains in Recall, F1-score and Area Under the Receiver Operating Characteristic Curve (AUC), which indicates that it is effective in overcoming the severe class imbalance inherent in the task of fraud detection.
- North America > United States > California (0.14)
- North America > Canada > Quebec > Montreal (0.04)
- Asia > Indonesia > Bali (0.04)
- Law Enforcement & Public Safety > Fraud (1.00)
- Banking & Finance (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Oversampling and Downsampling with Core-Boundary Awareness: A Data Quality-Driven Approach
Belhaouari, Samir Brahim, Kahalan, Yunis Carreon, Shaffique, Humaira, Belhaouari, Ismael, Islam, Ashhadul
The effectiveness of machine learning models, particularly in unbalanced classification tasks, is often hindered by the failure to differentiate between critical instances near the decision boundary and redundant samples concentrated in the core of the data distribution. In this paper, we propose a method to systematically identify and differentiate between these two types of data. Through extensive experiments on multiple benchmark datasets, we show that the boundary data oversampling method improves the F1 score by up to 10\% on 96\% of the datasets, whereas our core-aware reduction method compresses datasets up to 90\% while preserving their accuracy, making it 10 times more powerful than the original dataset. Beyond imbalanced classification, our method has broader implications for efficient model training, particularly in computationally expensive domains such as Large Language Model (LLM) training. By prioritizing high-quality, decision-relevant data, our approach can be extended to text, multimodal, and self-supervised learning scenarios, offering a pathway to faster convergence, improved generalization, and significant computational savings. This work paves the way for future research in data-efficient learning, where intelligent sampling replaces brute-force expansion, driving the next generation of AI advancements. Our code is available as a Python package at https://pypi.org/project/adaptive-resampling/ .
- Europe > Netherlands > Limburg > Maastricht (0.04)
- North America > United States > District of Columbia > Washington (0.04)
- Europe > Sweden > Stockholm > Stockholm (0.04)
- (2 more...)
Machine Learning-Based Classification of Vessel Types in Straits Using AIS Tracks
Accurate recognition of vessel types from Automatic Identification System (AIS) tracks is essential for safety oversight and combating illegal, unreported, and unregulated (IUU) activity. This paper presents a strait-scale, machine-learning pipeline that classifies moving vessels using only AIS data. We analyze eight days of historical AIS from the Danish Maritime Authority covering the Bornholm Strait in the Baltic Sea (January 22-30, 2025). After forward/backward filling voyage records, removing kinematic and geospatial outliers, and segmenting per-MMSI tracks while excluding stationary periods ($\ge 1$ h), we derive 31 trajectory-level features spanning kinematics (e.g., SOG statistics), temporal, geospatial (Haversine distances, spans), and ship-shape attributes computed from AIS A/B/C/D reference points (length, width, aspect ratio, bridge-position ratio). To avoid leakage, we perform grouped train/test splits by MMSI and use stratified 5-fold cross-validation. Across five classes (cargo, tanker, passenger, high-speed craft, fishing; N=1{,}910 trajectories; test=382), tree-based models dominate: a Random Forest with SMOTE attains 92.15% accuracy (macro-precision 94.11%, macro-recall 92.51%, macro-F1 93.27%) on the held-out test set, while a tuned RF reaches one-vs-rest ROC-AUC up to 0.9897. Feature-importance analysis highlights the bridge-position ratio and maximum SOG as the most discriminative signals; principal errors occur between cargo and tanker, reflecting similar transit behavior. We demonstrate operational value by backfilling missing ship types on unseen data and discuss improvements such as DBSCAN based trip segmentation and gradient-boosted ensembles to handle frequent-stop ferries and further lift performance. The results show that lightweight features over AIS trajectories enable real-time vessel type classification in straits.
- Atlantic Ocean > North Atlantic Ocean > Baltic Sea (0.25)
- North America > United States > Texas (0.04)
- Europe > Poland > West Pomerania Province > Świnoujście (0.04)
- Europe > Denmark > Capital Region > Copenhagen (0.04)
- Transportation > Marine (1.00)
- Food & Agriculture (0.93)
- Transportation > Freight & Logistics Services > Shipping (0.68)
- Government > Regional Government > North America Government > United States Government (0.46)
A Weak Supervision Approach for Monitoring Recreational Drug Use Effects in Social Media
Prieto-Santamaría, Lucía, Iglesias, Alba Cortés, Giné, Claudio Vidal, Calderón, Fermín Fernández, Lozano, Óscar M., Rodríguez-González, Alejandro
Understanding the real-world effects of recreational drug use remains a critical challenge in public health and biomedical research, especially as traditional surveillance systems often underrepresent user experiences. In this study, we leverage social media (specifically Twitter) as a rich and unfiltered source of user-reported effects associated with three emerging psychoactive substances: ecstasy, GHB, and 2C-B. By combining a curated list of slang terms with biomedical concept extraction via MetaMap, we identified and weakly annotated over 92,000 tweets mentioning these substances. Each tweet was labeled with a polarity reflecting whether it reported a positive or negative effect, following an expert-guided heuristic process. We then performed descriptive and comparative analyses of the reported phenotypic outcomes across substances and trained multiple machine learning classifiers to predict polarity from tweet content, accounting for strong class imbalance using techniques such as cost-sensitive learning and synthetic oversampling. The top performance on the test set was obtained from eXtreme Gradient Boosting with cost-sensitive learning (F1 = 0.885, AUPRC = 0.934). Our findings reveal that Twitter enables the detection of substance-specific phenotypic effects, and that polarity classification models can support real-time pharmacovigilance and drug effect characterization with high accuracy.