AITopics | imbalance problem

Collaborating Authors

imbalance problem

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

A Comprehensive Study of Supervised Machine Learning Models for Zero-Day Attack Detection: Analyzing Performance on Imbalanced Data

Lotfi, Zahra, Lotfi, Mostafa

arXiv.org Artificial IntelligenceDec-9-2025

Among the various types of cyberattacks, identifying zero-day attacks is problematic because they are unknown to security systems as their pattern and characteristics do not match known blacklisted attacks. There are many Machine Learning (ML) models designed to analyze and detect network attacks, especially using supervised models. However, these models are designed to classify samples (normal and attacks) based on the patterns they learn during the training phase, so they perform inefficiently on unseen attacks. This research addresses this issue by evaluating five different supervised models to assess their performance and execution time in predicting zero-day attacks and find out which model performs accurately and quickly. The goal is to improve the performance of these supervised models by not only proposing a framework that applies grid search, dimensionality reduction and oversampling methods to overcome the imbalance problem, but also comparing the effectiveness of oversampling on ml model metrics, in particular the accuracy. To emulate attack detection in real life, this research applies a highly imbalanced data set and only exposes the classifiers to zero-day attacks during the testing phase, so the models are not trained to flag the zero-day attacks. Our results show that Random Forest (RF) performs best under both oversampling and non-oversampling conditions, this increased effectiveness comes at the cost of longer processing times. Therefore, we selected XG Boost (XGB) as the top model due to its fast and highly accurate performance in detecting zero-day attacks.

artificial intelligence, machine learning, zero-day attack, (16 more...)

arXiv.org Artificial Intelligence

2512.0703

Country: North America > Canada (0.28)

Genre: Research Report > New Finding (1.00)

Industry:

Information Technology > Security & Privacy (1.00)
Government > Military > Cyberwarfare (0.49)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

Augmenting The Weather: A Hybrid Counterfactual-SMOTE Algorithm for Improving Crop Growth Prediction When Climate Changes

Temraz, Mohammed, Keane, Mark T

arXiv.org Artificial IntelligenceNov-18-2025

In recent years, humanity has begun to experien ce the catastrophic effects of climate change as economic sectors (such as agriculture) struggle with unpredictable and extreme weather events. Artificial Intelligence (AI) should help us handle these climate challenges but its most promising solutions are not good at dealing with climate - disrupted data; specifically, machine learning methods that work from historical data - distributions, are not good at handling out - of - distribution, outlier events. In this paper, we propose a novel data augmentation method, that treats the predictive problems around climate change as being, in part, due to class - imbalance issues; that is, prediction from historical datasets is difficult because, by definition, they lack sufficient minority - class instances of "climate outlier events". This novel data augmentation method -- called Counterfactual - Based SMOTE (CFA - SMOTE) -- combines an instance - based counterfactual method from Explainable AI (XAI) with the well - known class - imbalance method, SMOTE. CFA - SMOTE creates synthetic dat a - points representing outlier, climate - events that augment the dataset to improve predictive performance. We report comparative experiments using this CFA - SMOTE method, comparing it to benchmark counterfactual and class - imbalance methods under different co nditions (i.e., class - imbalance ratios). The focal climate - change domain used relies on predicting grass growth on Irish dairy farms, during Europe - wide drought and forage crisis of 2018.

artificial intelligence, machine learning, smote, (17 more...)

arXiv.org Artificial Intelligence

2511.11945

Country:

North America > United States (0.46)
Europe > Ireland (0.28)

Genre:

Research Report > New Finding (0.46)
Research Report > Promising Solution (0.34)

Industry:

Health & Medicine (1.00)
Government (1.00)
Food & Agriculture > Agriculture (0.87)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)

Add feedback

BalanceBenchmark: A Survey for Imbalanced Learning

Xu, Shaoxuan, Cui, Menglu, Huang, Chengxiang, Wang, Hongfa, DiHu, null

arXiv.org Artificial IntelligenceFeb-17-2025

Multimodal learning has gained attention for its capacity to integrate information from different modalities. However, it is often hindered by the multimodal imbalance problem, where certain modality dominates while others remain underutilized. Although recent studies have proposed various methods to alleviate this problem, they lack comprehensive and fair comparisons. In this paper, we systematically categorize various mainstream multimodal imbalance algorithms into four groups based on the strategies they employ to mitigate imbalance. To facilitate a comprehensive evaluation of these methods, we introduce BalanceBenchmark, a benchmark including multiple widely used multidimensional datasets and evaluation metrics from three perspectives: performance, imbalance degree, and complexity. To ensure fair comparisons, we have developed a modular and extensible toolkit that standardizes the experimental workflow across different methods. Based on the experiments using BalanceBenchmark, we have identified several key insights into the characteristics and advantages of different method groups in terms of performance, balance degree and computational complexity. We expect such analysis could inspire more efficient approaches to address the imbalance problem in the future, as well as foundation models. The code of the toolkit is available at https://github.com/GeWu-Lab/BalanceBenchmark.

artificial intelligence, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2502.10816

Country: Asia > China (0.47)

Genre: Research Report (0.84)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Vision (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.46)

Add feedback

Architecture for Trajectory-Based Fishing Ship Classification with AIS Data

Pedroche, David Sánchez, Amigo, Daniel, García, Jesús, Molina, Jose M.

arXiv.org Artificial IntelligenceJan-3-2025

This paper proposes a data preparation process for managing real-world kinematic data and detecting fishing vessels. The solution is a binary classification that classifies ship trajectories into either fishing or non-fishing ships. The data used are characterized by the typical problems found in classic data mining applications using real-world data, such as noise and inconsistencies. The two classes are also clearly unbalanced in the data, a problem which is addressed using algorithms that resample the instances. For classification, a series of features are extracted from spatiotemporal data that represent the trajectories of the ships, available from sequences of Automatic Identification System (AIS) reports. These features are proposed for the modelling of ship behavior but, because they do not contain context-related information, the classification can be applied in other scenarios. Experimentation shows that the proposed data preparation process is useful for the presented classification problem. In addition, positive results are obtained using minimal information.

classification, data mining, machine learning, (19 more...)

arXiv.org Artificial Intelligence

doi: 10.3390/s20133782

2501.02038

Country:

Europe (1.00)
North America > United States (0.46)

Genre:

Workflow (1.00)
Research Report > New Finding (1.00)

Industry: Transportation > Marine (1.00)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Support Vector Machines (0.46)

Add feedback

A Similarity-Based Oversampling Method for Multi-label Imbalanced Text Data

Karaman, Ismail Hakki, Koksal, Gulser, Eriskin, Levent, Salihoglu, Salih

arXiv.org Artificial IntelligenceDec-21-2024

In real-world applications, as data availability increases, obtaining labeled data for machine learning (ML) projects remains challenging due to the high costs and intensive efforts required for data annotation. Many ML projects, particularly those focused on multi-label classification, also grapple with data imbalance issues, where certain classes may lack sufficient data to train effective classifiers. This study introduces and examines a novel oversampling method for multi-label text classification, designed to address performance challenges associated with data imbalance. The proposed method identifies potential new samples from unlabeled data by leveraging similarity measures between instances. By iteratively searching the unlabeled dataset, the method locates instances similar to those in underrepresented classes and evaluates their contribution to classifier performance enhancement. Instances that demonstrate performance improvement are then added to the labeled dataset. Experimental results indicate that the proposed approach effectively enhances classifier performance post-oversampling.

classification, large language model, machine learning, (22 more...)

arXiv.org Artificial Intelligence

2411.01013

Country:

Asia > Middle East > Republic of Türkiye > Ankara Province > Ankara (0.04)
Europe > Middle East > Republic of Türkiye > Istanbul Province > Istanbul (0.04)
Asia > Middle East > Republic of Türkiye > Karaman Province > Karaman (0.04)
Asia > Middle East > Republic of Türkiye > Istanbul Province > Istanbul (0.04)

Genre:

Overview (1.00)
Research Report > New Finding (0.68)
Instructional Material > Course Syllabus & Notes (0.46)

Industry: Information Technology > Security & Privacy (0.93)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
(2 more...)

Add feedback

An Effective Context-Balanced Adaptation Approach for Long-Tailed Speech Recognition

Wang, Yi-Cheng, Pai, Li-Ting, Yan, Bi-Cheng, Wang, Hsin-Wei, Lin, Chi-Han, Chen, Berlin

arXiv.org Artificial IntelligenceSep-10-2024

End-to-end (E2E) automatic speech recognition (ASR) models have become standard practice for various commercial applications. However, in real-world scenarios, the long-tailed nature of word distribution often leads E2E ASR models to perform well on common words but fall short in recognizing uncommon ones. Recently, the notion of a contextual adapter (CA) was proposed to infuse external knowledge represented by a context word list into E2E ASR models. Although CA can improve recognition performance on rare words, two crucial data imbalance problems remain. First, when using low-frequency words as context words during training, since these words rarely occur in the utterance, CA becomes prone to overfit on attending to the token due to higher-frequency words not being present in the context list. Second, the long-tailed distribution within the context list itself still causes the model to perform poorly on low-frequency context words. In light of this, we explore in-depth the impact of altering the context list to have words with different frequency distributions on model performance, and meanwhile extend CA with a simple yet effective context-balanced learning objective. A series of experiments conducted on the AISHELL-1 benchmark dataset suggests that using all vocabulary words from the training corpus as the context list and pairing them with our balanced objective yields the best performance, demonstrating a significant reduction in character error rate (CER) by up to 1.21% and a more pronounced 9.44% reduction in the error rate of zero-shot words.

adapter, context list, context word, (16 more...)

arXiv.org Artificial Intelligence

2409.06468

Country: Asia > Taiwan > Taiwan Province > Taipei (0.04)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Distinguish Confusion in Legal Judgment Prediction via Revised Relation Knowledge

Xu, Nuo, Wang, Pinghui, Zhao, Junzhou, Sun, Feiyang, Lan, Lin, Tao, Jing, Pan, Li, Guan, Xiaohong

arXiv.org Artificial IntelligenceAug-18-2024

Legal Judgment Prediction (LJP) aims to automatically predict a law case's judgment results based on the text description of its facts. In practice, the confusing law articles (or charges) problem frequently occurs, reflecting that the law cases applicable to similar articles (or charges) tend to be misjudged. Although some recent works based on prior knowledge solve this issue well, they ignore that confusion also occurs between law articles with a high posterior semantic similarity due to the data imbalance problem instead of only between the prior highly similar ones, which is this work's further finding. This paper proposes an end-to-end model named \textit{D-LADAN} to solve the above challenges. On the one hand, D-LADAN constructs a graph among law articles based on their text definition and proposes a graph distillation operation (GDO) to distinguish the ones with a high prior semantic similarity. On the other hand, D-LADAN presents a novel momentum-updated memory mechanism to dynamically sense the posterior similarity between law articles (or charges) and a weighted GDO to adaptively capture the distinctions for revising the inductive bias caused by the data imbalance problem. We perform extensive experiments to demonstrate that D-LADAN significantly outperforms state-of-the-art methods in accuracy and robustness.

fact description, law article, representation, (13 more...)

arXiv.org Artificial Intelligence

2408.09422

Country:

Asia > China > Shaanxi Province > Xi'an (0.05)
Asia > China > Shanghai > Shanghai (0.04)
Asia > China > Zhejiang Province > Hangzhou (0.04)
Asia > China > Beijing > Beijing (0.04)

Genre:

Research Report > New Finding (0.46)
Research Report > Promising Solution (0.34)

Industry: Law > Criminal Law (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.86)

Add feedback

Automatic Recognition of Food Ingestion Environment from the AIM-2 Wearable Sensor

Huang, Yuning, Hassan, Mohamed Abul, He, Jiangpeng, Higgins, Janine, McCrory, Megan, Eicher-Miller, Heather, Thomas, Graham, Sazonov, Edward O, Zhu, Fengqing Maggie

arXiv.org Artificial IntelligenceMay-13-2024

Detecting an ingestion environment is an important aspect of monitoring dietary intake. It provides insightful information for dietary assessment. However, it is a challenging problem where human-based reviewing can be tedious, and algorithm-based review suffers from data imbalance and perceptual aliasing problems. To address these issues, we propose a neural network-based method with a two-stage training framework that tactfully combines fine-tuning and transfer learning techniques. Our method is evaluated on a newly collected dataset called ``UA Free Living Study", which uses an egocentric wearable camera, AIM-2 sensor, to simulate food consumption in free-living conditions. The proposed training framework is applied to common neural network backbones, combined with approaches in the general imbalanced classification field. Experimental results on the collected dataset show that our proposed method for automatic ingestion environment recognition successfully addresses the challenging data imbalance problem in the dataset and achieves a promising overall classification accuracy of 96.63%.

classifier, dataset, recognition, (16 more...)

arXiv.org Artificial Intelligence

2405.07827

Country:

North America > United States > Iowa (0.04)
North America > United States > Alabama (0.04)

Genre: Research Report > New Finding (0.46)

Industry: Health & Medicine > Consumer Health (0.47)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

Balancing the Causal Effects in Class-Incremental Learning

Zheng, Junhao, Wang, Ruiyan, Zhang, Chongzhi, Feng, Huawen, Ma, Qianli

arXiv.org Artificial IntelligenceFeb-15-2024

Class-Incremental Learning (CIL) is a practical and challenging problem for achieving general artificial intelligence. Recently, Pre-Trained Models (PTMs) have led to breakthroughs in both visual and natural language processing tasks. Despite recent studies showing PTMs' potential ability to learn sequentially, a plethora of work indicates the necessity of alleviating the catastrophic forgetting of PTMs. Through a pilot study and a causal analysis of CIL, we reveal that the crux lies in the imbalanced causal effects between new and old data. Specifically, the new data encourage models to adapt to new classes while hindering the adaptation of old classes. Similarly, the old data encourages models to adapt to old classes while hindering the adaptation of new classes. In other words, the adaptation process between new and old classes conflicts from the causal perspective. To alleviate this problem, we propose Balancing the Causal Effects (BaCE) in CIL. Concretely, BaCE proposes two objectives for building causal paths from both new and old data to the prediction of new and classes, respectively. In this way, the model is encouraged to adapt to all classes with causal effects from both new and old data and thus alleviates the causal imbalance problem. We conduct extensive experiments on continual image classification, continual text classification, and continual named entity recognition. Empirical results show that BaCE outperforms a series of CIL methods on different tasks and settings.

bace, causal effect, learning, (16 more...)

arXiv.org Artificial Intelligence

2402.10063

Country:

North America > Canada > Ontario > Toronto (0.04)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.04)
North America > United States > Hawaii > Honolulu County > Honolulu (0.04)
(4 more...)

Genre: Research Report > New Finding (0.68)

Industry: Education (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)

Add feedback

UQ at #SMM4H 2023: ALEX for Public Health Analysis with Social Media

Jiang, Yan, Qiu, Ruihong, Zhang, Yi, Huang, Zi

arXiv.org Artificial IntelligenceSep-12-2023

As social media becomes increasingly popular, more and more activities related to public health emerge. Current techniques for public health analysis involve popular models such as BERT and large language models (LLMs). However, the costs of training in-domain LLMs for public health are especially expensive. Furthermore, such kinds of in-domain datasets from social media are generally imbalanced. To tackle these challenges, the data imbalance issue can be overcome by data augmentation and balanced training. Moreover, the ability of the LLMs can be effectively utilized by prompting the model properly. In this paper, a novel ALEX framework is proposed to improve the performance of public health analysis on social media by adopting an LLMs explanation mechanism. Results show that our ALEX model got the best performance among all submissions in both Task 2 and Task 4 with a high score in Task 1 in Social Media Mining for Health 2023 (SMM4H)[1]. Our code has been released at https:// github.com/YanJiangJerry/ALEX.

label 1, task 2, task 4, (16 more...)

arXiv.org Artificial Intelligence

2309.04213

Country: Oceania > Australia > Queensland (0.04)

Genre: Research Report (0.70)

Industry: Health & Medicine > Public Health (1.00)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)

Add feedback