imbalance problem
A Comprehensive Study of Supervised Machine Learning Models for Zero-Day Attack Detection: Analyzing Performance on Imbalanced Data
Among the various types of cyberattacks, identifying zero-day attacks is problematic because they are unknown to security systems as their pattern and characteristics do not match known blacklisted attacks. There are many Machine Learning (ML) models designed to analyze and detect network attacks, especially using supervised models. However, these models are designed to classify samples (normal and attacks) based on the patterns they learn during the training phase, so they perform inefficiently on unseen attacks. This research addresses this issue by evaluating five different supervised models to assess their performance and execution time in predicting zero-day attacks and find out which model performs accurately and quickly. The goal is to improve the performance of these supervised models by not only proposing a framework that applies grid search, dimensionality reduction and oversampling methods to overcome the imbalance problem, but also comparing the effectiveness of oversampling on ml model metrics, in particular the accuracy. To emulate attack detection in real life, this research applies a highly imbalanced data set and only exposes the classifiers to zero-day attacks during the testing phase, so the models are not trained to flag the zero-day attacks. Our results show that Random Forest (RF) performs best under both oversampling and non-oversampling conditions, this increased effectiveness comes at the cost of longer processing times. Therefore, we selected XG Boost (XGB) as the top model due to its fast and highly accurate performance in detecting zero-day attacks.
Augmenting The Weather: A Hybrid Counterfactual-SMOTE Algorithm for Improving Crop Growth Prediction When Climate Changes
Temraz, Mohammed, Keane, Mark T
In recent years, humanity has begun to experien ce the catastrophic effects of climate change as economic sectors (such as agriculture) struggle with unpredictable and extreme weather events. Artificial Intelligence (AI) should help us handle these climate challenges but its most promising solutions are not good at dealing with climate - disrupted data; specifically, machine learning methods that work from historical data - distributions, are not good at handling out - of - distribution, outlier events. In this paper, we propose a novel data augmentation method, that treats the predictive problems around climate change as being, in part, due to class - imbalance issues; that is, prediction from historical datasets is difficult because, by definition, they lack sufficient minority - class instances of "climate outlier events". This novel data augmentation method -- called Counterfactual - Based SMOTE (CFA - SMOTE) -- combines an instance - based counterfactual method from Explainable AI (XAI) with the well - known class - imbalance method, SMOTE. CFA - SMOTE creates synthetic dat a - points representing outlier, climate - events that augment the dataset to improve predictive performance. We report comparative experiments using this CFA - SMOTE method, comparing it to benchmark counterfactual and class - imbalance methods under different co nditions (i.e., class - imbalance ratios). The focal climate - change domain used relies on predicting grass growth on Irish dairy farms, during Europe - wide drought and forage crisis of 2018.
BalanceBenchmark: A Survey for Imbalanced Learning
Xu, Shaoxuan, Cui, Menglu, Huang, Chengxiang, Wang, Hongfa, DiHu, null
Multimodal learning has gained attention for its capacity to integrate information from different modalities. However, it is often hindered by the multimodal imbalance problem, where certain modality dominates while others remain underutilized. Although recent studies have proposed various methods to alleviate this problem, they lack comprehensive and fair comparisons. In this paper, we systematically categorize various mainstream multimodal imbalance algorithms into four groups based on the strategies they employ to mitigate imbalance. To facilitate a comprehensive evaluation of these methods, we introduce BalanceBenchmark, a benchmark including multiple widely used multidimensional datasets and evaluation metrics from three perspectives: performance, imbalance degree, and complexity. To ensure fair comparisons, we have developed a modular and extensible toolkit that standardizes the experimental workflow across different methods. Based on the experiments using BalanceBenchmark, we have identified several key insights into the characteristics and advantages of different method groups in terms of performance, balance degree and computational complexity. We expect such analysis could inspire more efficient approaches to address the imbalance problem in the future, as well as foundation models. The code of the toolkit is available at https://github.com/GeWu-Lab/BalanceBenchmark.
Architecture for Trajectory-Based Fishing Ship Classification with AIS Data
Pedroche, David Sรกnchez, Amigo, Daniel, Garcรญa, Jesรบs, Molina, Jose M.
This paper proposes a data preparation process for managing real-world kinematic data and detecting fishing vessels. The solution is a binary classification that classifies ship trajectories into either fishing or non-fishing ships. The data used are characterized by the typical problems found in classic data mining applications using real-world data, such as noise and inconsistencies. The two classes are also clearly unbalanced in the data, a problem which is addressed using algorithms that resample the instances. For classification, a series of features are extracted from spatiotemporal data that represent the trajectories of the ships, available from sequences of Automatic Identification System (AIS) reports. These features are proposed for the modelling of ship behavior but, because they do not contain context-related information, the classification can be applied in other scenarios. Experimentation shows that the proposed data preparation process is useful for the presented classification problem. In addition, positive results are obtained using minimal information.
A Similarity-Based Oversampling Method for Multi-label Imbalanced Text Data
Karaman, Ismail Hakki, Koksal, Gulser, Eriskin, Levent, Salihoglu, Salih
In real-world applications, as data availability increases, obtaining labeled data for machine learning (ML) projects remains challenging due to the high costs and intensive efforts required for data annotation. Many ML projects, particularly those focused on multi-label classification, also grapple with data imbalance issues, where certain classes may lack sufficient data to train effective classifiers. This study introduces and examines a novel oversampling method for multi-label text classification, designed to address performance challenges associated with data imbalance. The proposed method identifies potential new samples from unlabeled data by leveraging similarity measures between instances. By iteratively searching the unlabeled dataset, the method locates instances similar to those in underrepresented classes and evaluates their contribution to classifier performance enhancement. Instances that demonstrate performance improvement are then added to the labeled dataset. Experimental results indicate that the proposed approach effectively enhances classifier performance post-oversampling.
An Effective Context-Balanced Adaptation Approach for Long-Tailed Speech Recognition
Wang, Yi-Cheng, Pai, Li-Ting, Yan, Bi-Cheng, Wang, Hsin-Wei, Lin, Chi-Han, Chen, Berlin
End-to-end (E2E) automatic speech recognition (ASR) models have become standard practice for various commercial applications. However, in real-world scenarios, the long-tailed nature of word distribution often leads E2E ASR models to perform well on common words but fall short in recognizing uncommon ones. Recently, the notion of a contextual adapter (CA) was proposed to infuse external knowledge represented by a context word list into E2E ASR models. Although CA can improve recognition performance on rare words, two crucial data imbalance problems remain. First, when using low-frequency words as context words during training, since these words rarely occur in the utterance, CA becomes prone to overfit on attending to the
Distinguish Confusion in Legal Judgment Prediction via Revised Relation Knowledge
Xu, Nuo, Wang, Pinghui, Zhao, Junzhou, Sun, Feiyang, Lan, Lin, Tao, Jing, Pan, Li, Guan, Xiaohong
Legal Judgment Prediction (LJP) aims to automatically predict a law case's judgment results based on the text description of its facts. In practice, the confusing law articles (or charges) problem frequently occurs, reflecting that the law cases applicable to similar articles (or charges) tend to be misjudged. Although some recent works based on prior knowledge solve this issue well, they ignore that confusion also occurs between law articles with a high posterior semantic similarity due to the data imbalance problem instead of only between the prior highly similar ones, which is this work's further finding. This paper proposes an end-to-end model named \textit{D-LADAN} to solve the above challenges. On the one hand, D-LADAN constructs a graph among law articles based on their text definition and proposes a graph distillation operation (GDO) to distinguish the ones with a high prior semantic similarity. On the other hand, D-LADAN presents a novel momentum-updated memory mechanism to dynamically sense the posterior similarity between law articles (or charges) and a weighted GDO to adaptively capture the distinctions for revising the inductive bias caused by the data imbalance problem. We perform extensive experiments to demonstrate that D-LADAN significantly outperforms state-of-the-art methods in accuracy and robustness.
Automatic Recognition of Food Ingestion Environment from the AIM-2 Wearable Sensor
Huang, Yuning, Hassan, Mohamed Abul, He, Jiangpeng, Higgins, Janine, McCrory, Megan, Eicher-Miller, Heather, Thomas, Graham, Sazonov, Edward O, Zhu, Fengqing Maggie
Detecting an ingestion environment is an important aspect of monitoring dietary intake. It provides insightful information for dietary assessment. However, it is a challenging problem where human-based reviewing can be tedious, and algorithm-based review suffers from data imbalance and perceptual aliasing problems. To address these issues, we propose a neural network-based method with a two-stage training framework that tactfully combines fine-tuning and transfer learning techniques. Our method is evaluated on a newly collected dataset called ``UA Free Living Study", which uses an egocentric wearable camera, AIM-2 sensor, to simulate food consumption in free-living conditions. The proposed training framework is applied to common neural network backbones, combined with approaches in the general imbalanced classification field. Experimental results on the collected dataset show that our proposed method for automatic ingestion environment recognition successfully addresses the challenging data imbalance problem in the dataset and achieves a promising overall classification accuracy of 96.63%.
Balancing the Causal Effects in Class-Incremental Learning
Zheng, Junhao, Wang, Ruiyan, Zhang, Chongzhi, Feng, Huawen, Ma, Qianli
Class-Incremental Learning (CIL) is a practical and challenging problem for achieving general artificial intelligence. Recently, Pre-Trained Models (PTMs) have led to breakthroughs in both visual and natural language processing tasks. Despite recent studies showing PTMs' potential ability to learn sequentially, a plethora of work indicates the necessity of alleviating the catastrophic forgetting of PTMs. Through a pilot study and a causal analysis of CIL, we reveal that the crux lies in the imbalanced causal effects between new and old data. Specifically, the new data encourage models to adapt to new classes while hindering the adaptation of old classes. Similarly, the old data encourages models to adapt to old classes while hindering the adaptation of new classes. In other words, the adaptation process between new and old classes conflicts from the causal perspective. To alleviate this problem, we propose Balancing the Causal Effects (BaCE) in CIL. Concretely, BaCE proposes two objectives for building causal paths from both new and old data to the prediction of new and classes, respectively. In this way, the model is encouraged to adapt to all classes with causal effects from both new and old data and thus alleviates the causal imbalance problem. We conduct extensive experiments on continual image classification, continual text classification, and continual named entity recognition. Empirical results show that BaCE outperforms a series of CIL methods on different tasks and settings.
UQ at #SMM4H 2023: ALEX for Public Health Analysis with Social Media
Jiang, Yan, Qiu, Ruihong, Zhang, Yi, Huang, Zi
As social media becomes increasingly popular, more and more activities related to public health emerge. Current techniques for public health analysis involve popular models such as BERT and large language models (LLMs). However, the costs of training in-domain LLMs for public health are especially expensive. Furthermore, such kinds of in-domain datasets from social media are generally imbalanced. To tackle these challenges, the data imbalance issue can be overcome by data augmentation and balanced training. Moreover, the ability of the LLMs can be effectively utilized by prompting the model properly. In this paper, a novel ALEX framework is proposed to improve the performance of public health analysis on social media by adopting an LLMs explanation mechanism. Results show that our ALEX model got the best performance among all submissions in both Task 2 and Task 4 with a high score in Task 1 in Social Media Mining for Health 2023 (SMM4H)[1]. Our code has been released at https:// github.com/YanJiangJerry/ALEX.