class imbalance problem
A Comprehensive Study of Supervised Machine Learning Models for Zero-Day Attack Detection: Analyzing Performance on Imbalanced Data
Among the various types of cyberattacks, identifying zero-day attacks is problematic because they are unknown to security systems as their pattern and characteristics do not match known blacklisted attacks. There are many Machine Learning (ML) models designed to analyze and detect network attacks, especially using supervised models. However, these models are designed to classify samples (normal and attacks) based on the patterns they learn during the training phase, so they perform inefficiently on unseen attacks. This research addresses this issue by evaluating five different supervised models to assess their performance and execution time in predicting zero-day attacks and find out which model performs accurately and quickly. The goal is to improve the performance of these supervised models by not only proposing a framework that applies grid search, dimensionality reduction and oversampling methods to overcome the imbalance problem, but also comparing the effectiveness of oversampling on ml model metrics, in particular the accuracy. To emulate attack detection in real life, this research applies a highly imbalanced data set and only exposes the classifiers to zero-day attacks during the testing phase, so the models are not trained to flag the zero-day attacks. Our results show that Random Forest (RF) performs best under both oversampling and non-oversampling conditions, this increased effectiveness comes at the cost of longer processing times. Therefore, we selected XG Boost (XGB) as the top model due to its fast and highly accurate performance in detecting zero-day attacks.
Spatiotemporal deep learning models for detection of rapid intensification in cyclones
Sutar, Vamshika, Singh, Amandeep, Chandra, Rohitash
Cyclone rapid intensification is the rapid increase in cyclone wind intensity, exceeding a threshold of 30 knots, within 24 hours. Rapid intensification is considered an extreme event during a cyclone, and its occurrence is relatively rare, contributing to a class imbalance in the dataset. A diverse array of factors influences the likelihood of a cyclone undergoing rapid intensification, further complicating the task for conventional machine learning models. In this paper, we evaluate deep learning, ensemble learning and data augmentation frameworks to detect cyclone rapid intensification based on wind intensity and spatial coordinates. We note that conventional data augmentation methods cannot be utilised for generating spatiotemporal patterns replicating cyclones that undergo rapid intensification. Therefore, our framework employs deep learning models to generate spatial coordinates and wind intensity that replicate cyclones to address the class imbalance problem of rapid intensification. We also use a deep learning model for the classification module within the data augmentation framework to di fferentiate between rapid and non-rapid intensification events during a cyclone. Our results show that data augmentation improves the results for rapid intensification detection in cyclones, and spatial coordinates play a critical role as input features to the given models. This paves the way for research in synthetic data generation for spatiotemporal data with extreme events. Introduction Over the past decade, the impacts of climate change have manifested in an alarming increase in the strength of tropical cyclones, characterised by elevated levels of precipitation and wind intensity, resulting in devastating consequences on a global scale [1, 2, 3]. Rappaport et al. [4] defined rapid intensification as a sudden surge in wind intensity exceeding 30 knots (35 miles / hour or 55 kilometres / hour) within 24 hours [5]. Forecasting the rapid intensification of high-category cyclones (Category 4 and 5) poses greater challenges due to their infrequent occurrence, in contrast to lower-category cyclones[6].
Compressing Model with Few Class-Imbalance Samples: An Out-of-Distribution Expedition
Wu, Tian-Shuang, Lyu, Shen-Huan, Chen, Ning, Qu, Zhihao, Ye, Baoliu
In recent years, as a compromise between privacy and performance, few-sample model compression has been widely adopted to deal with limited data resulting from privacy and security concerns. However, when the number of available samples is extremely limited, class imbalance becomes a common and tricky problem. Achieving an equal number of samples across all classes is often costly and impractical in real-world applications, and previous studies on few-sample model compression have mostly ignored this significant issue. Our experiments comprehensively demonstrate that class imbalance negatively affects the overall performance of few-sample model compression methods. To address this problem, we propose a novel and adaptive framework named OOD-Enhanced Few-Sample Model Compression (OE-FSMC). This framework integrates easily accessible out-of-distribution (OOD) data into both the compression and fine-tuning processes, effectively rebalancing the training distribution. We also incorporate a joint distillation loss and a regularization term to reduce the risk of the model overfitting to the OOD data. Extensive experiments on multiple benchmark datasets show that our framework can be seamlessly incorporated into existing few-sample model compression methods, effectively mitigating the accuracy degradation caused by class imbalance.
Deep Learning Meets Oversampling: A Learning Framework to Handle Imbalanced Classification
Kishanthan, Sukumar, Hevapathige, Asela
This disproportion often leads to biased model training, making the classifier inclined towards predicting the majority class in the inference phase[1, 2]. The class imbalance problem cannot be readily overlooked, as many real-world datasets related to critical tasks, such as those used in the medical field for disease identification, the finance sector for fraud detection, and network intrusion datasets used in cyber security, exhibit such asymmetric class distributions [3, 4, 5]. Existing machine learning and deep learning approaches primarily utilize resampling techniques to tackle class imbalance which involves adjustment techniques to balance the class distribution in datasets [6, 7]. Among diverse resampling techniques, Oversampling approaches are commonly preferred for addressing class imbalance mainly due to their inherent ability to equalize the class distribution while preserving data semantics and achieving superior performance. There has been a plethora of different oversampling techniques proposed in the literature, ranging from traditional approaches [8, 9, 10, 11, 12] to those based on deep learning [13, 14, 15].
Dealing with Imbalanced Classes in Bot-IoT Dataset
Atuhurra, Jesse, Hara, Takanori, Zhang, Yuanyu, Sasabe, Masahiro, Kasahara, Shoji
With the rapidly spreading usage of Internet of Things (IoT) devices, a network intrusion detection system (NIDS) plays an important role in detecting and protecting various types of attacks in the IoT network. To evaluate the robustness of the NIDS in the IoT network, the existing work proposed a realistic botnet dataset in the IoT network (Bot-IoT dataset) and applied it to machine learning-based anomaly detection. This dataset contains imbalanced normal and attack packets because the number of normal packets is much smaller than that of attack ones. The nature of imbalanced data may make it difficult to identify the minority class correctly. In this thesis, to address the class imbalance problem in the Bot-IoT dataset, we propose a binary classification method with synthetic minority over-sampling techniques (SMOTE). The proposed classifier aims to detect attack packets and overcome the class imbalance problem using the SMOTE algorithm. Through numerical results, we demonstrate the proposed classifier's fundamental characteristics and the impact of imbalanced data on its performance.
DA-PFL: Dynamic Affinity Aggregation for Personalized Federated Learning
Yang, Xu, Feng, Jiyuan, Guo, Songyue, Wang, Ye, Ding, Ye, Fang, Binxing, Liao, Qing
Personalized federated learning becomes a hot research topic that can learn a personalized learning model for each client. Existing personalized federated learning models prefer to aggregate similar clients with similar data distribution to improve the performance of learning models. However, similaritybased personalized federated learning methods may exacerbate the class imbalanced problem. In this paper, we propose a novel Dynamic Affinity-based Personalized Federated Learning model (DA-PFL) to alleviate the class imbalanced problem during federated learning. Specifically, we build an affinity metric from a complementary perspective to guide which clients should be aggregated. Then we design a dynamic aggregation strategy to dynamically aggregate clients based on the affinity metric in each round to reduce the class imbalanced risk. Extensive experiments show that the proposed DA-PFL model can significantly improve the accuracy of each client in three real-world datasets with state-of-the-art comparison methods.
Random Forest Variable Importance-based Selection Algorithm in Class Imbalance Problem
Random Forest is a machine learning method that offers many advantages, including the ability to easily measure variable importance. Class balancing technique is a well-known solution to deal with class imbalance problem. However, it has not been actively studied on RF variable importance. In this paper, we study the effect of class balancing on RF variable importance. Our simulation results show that over-sampling is effective in correctly measuring variable importance in class imbalanced situations with small sample size, while under-sampling fails to differentiate important and non-informative variables. We then propose a variable selection algorithm that utilizes RF variable importance and its confidence interval. Through an experimental study using many real and artificial datasets, we demonstrate that our proposed algorithm efficiently selects an optimal feature set, leading to improved prediction performance in class imbalance problem.
A review of ensemble learning and data augmentation models for class imbalanced problems: combination, implementation and evaluation
Khan, Azal Ahmad, Chaudhari, Omkar, Chandra, Rohitash
Class imbalance (CI) in classification problems arises when the number of observations belonging to one class is lower than the other. Ensemble learning combines multiple models to obtain a robust model and has been prominently used with data augmentation methods to address class imbalance problems. In the last decade, a number of strategies have been added to enhance ensemble learning and data augmentation methods, along with new methods such as generative adversarial networks (GANs). A combination of these has been applied in many studies, and the evaluation of different combinations would enable a better understanding and guidance for different application domains. In this paper, we present a computational study to evaluate data augmentation and ensemble learning methods used to address prominent benchmark CI problems. We present a general framework that evaluates 9 data augmentation and 9 ensemble learning methods for CI problems. Our objective is to identify the most effective combination for improving classification performance on imbalanced datasets. The results indicate that combinations of data augmentation methods with ensemble learning can significantly improve classification performance on imbalanced datasets. We find that traditional data augmentation methods such as the synthetic minority oversampling technique (SMOTE) and random oversampling (ROS) are not only better in performance for selected CI problems, but also computationally less expensive than GANs. Our study is vital for the development of novel models for handling imbalanced datasets.
NPRL: Nightly Profile Representation Learning for Early Sepsis Onset Prediction in ICU Trauma Patients
Stewart, Tucker, Stern, Katherine, O'Keefe, Grant, Teredesai, Ankur, Hu, Juhua
Sepsis is a syndrome that develops in the body in response to the presence of an infection. Characterized by severe organ dysfunction, sepsis is one of the leading causes of mortality in Intensive Care Units (ICUs) worldwide. These complications can be reduced through early application of antibiotics. Hence, the ability to anticipate the onset of sepsis early is crucial to the survival and well-being of patients. Current machine learning algorithms deployed inside medical infrastructures have demonstrated poor performance and are insufficient for anticipating sepsis onset early. Recently, deep learning methodologies have been proposed to predict sepsis, but some fail to capture the time of onset (e.g., classifying patients' entire visits as developing sepsis or not) and others are unrealistic for deployment in clinical settings (e.g., creating training instances using a fixed time to onset, where the time of onset needs to be known apriori). In this paper, we first propose a novel but realistic prediction framework that predicts each morning whether sepsis onset will occur within the next 24 hours using the most recent data collected the previous night, when patient-provider ratios are higher due to cross-coverage resulting in limited observation to each patient. However, as we increase the prediction rate into daily, the number of negative instances will increase, while that of positive instances remain the same. This causes a severe class imbalance problem making it hard to capture these rare sepsis cases. To address this, we propose a nightly profile representation learning (NPRL) approach. We prove that NPRL can theoretically alleviate the rare event problem and our empirical study using data from a level-1 trauma center demonstrates the effectiveness of our proposal.
Projection based fuzzy least squares twin support vector machine for class imbalance problems
Tanveer, M., Mishra, Ritik, Richhariya, Bharat
Class imbalance is a major problem in many real world classification tasks. Due to the imbalance in the number of samples, the support vector machine (SVM) classifier gets biased toward the majority class. Furthermore, these samples are often observed with a certain degree of noise. Therefore, to remove these problems we propose a novel fuzzy based approach to deal with class imbalanced as well noisy datasets. We propose two approaches to address these problems. The first approach is based on the intuitionistic fuzzy membership, termed as robust energy-based intuitionistic fuzzy least squares twin support vector machine (IF-RELSTSVM). Furthermore, we introduce the concept of hyperplane-based fuzzy membership in our second approach, where the final classifier is termed as robust energy-based fuzzy least square twin support vector machine (F-RELSTSVM). By using this technique, the membership values are based on a projection based approach, where the data points are projected on the hyperplanes. The performance of the proposed algorithms is evaluated on several benchmark and synthetic datasets. The experimental results show that the proposed IF-RELSTSVM and F-RELSTSVM models outperform the baseline algorithms. Statistical tests are performed to check the significance of the proposed algorithms. The results show the applicability of the proposed algorithms on noisy as well as imbalanced datasets.