Ensemble Learning
Harnessing PU Learning for Enhanced Cloud-based DDoS Detection: A Comparative Analysis
Dilworth, Robert, Gudla, Charan
This paper explores the application of Positive-Unlabeled (PU) learning for enhanced Distributed Denial-of-Service (DDoS) detection in cloud environments. Utilizing the $\texttt{BCCC-cPacket-Cloud-DDoS-2024}$ dataset, we implement PU learning with four machine learning algorithms: XGBoost, Random Forest, Support Vector Machine, and Na\"{i}ve Bayes. Our results demonstrate the superior performance of ensemble methods, with XGBoost and Random Forest achieving $F_{1}$ scores exceeding 98%. We quantify the efficacy of each approach using metrics including $F_{1}$ score, ROC AUC, Recall, and Precision. This study bridges the gap between PU learning and cloud-based anomaly detection, providing a foundation for addressing Context-Aware DDoS Detection in multi-cloud environments. Our findings highlight the potential of PU learning in scenarios with limited labeled data, offering valuable insights for developing more robust and adaptive cloud security mechanisms.
A Random Forest approach to detect and identify Unlawful Insider Trading
According to The Exchange Act, 1934 unlawful insider trading is the abuse of access to privileged corporate information. While a blurred line between "routine" the "opportunistic" insider trading exists, detection of strategies that insiders mold to maneuver fair market prices to their advantage is an uphill battle for hand-engineered approaches. In the context of detailed high-dimensional financial and trade data that are structurally built by multiple covariates, in this study, we explore, implement and provide detailed comparison to the existing study (Deng et al. (2019)) and independently implement automated end-to-end state-of-art methods by integrating principal component analysis to the random forest (PCA-RF) followed by a standalone random forest (RF) with 320 and 3984 randomly selected, semi-manually labeled and normalized transactions from multiple industry. The settings successfully uncover latent structures and detect unlawful insider trading. Among the multiple scenarios, our best-performing model accurately classified 96.43 percent of transactions. Among all transactions the models find 95.47 lawful as lawful and $98.00$ unlawful as unlawful percent. Besides, the model makes very few mistakes in classifying lawful as unlawful by missing only 2.00 percent. In addition to the classification task, model generated Gini Impurity based features ranking, our analysis show ownership and governance related features based on permutation values play important roles. In summary, a simple yet powerful automated end-to-end method relieves labor-intensive activities to redirect resources to enhance rule-making and tracking the uncaptured unlawful insider trading transactions. We emphasize that developed financial and trading features are capable of uncovering fraudulent behaviors.
Machine learning-driven Anomaly Detection and Forecasting for Euclid Space Telescope Operations
Gómez, Pablo, Vavrek, Roland D., Buenadicha, Guillermo, Hoar, John, Kruk, Sandor, Reerink, Jan
State-of-the-art space science missions increasingly rely on automation due to spacecraft complexity and the costs of human oversight. The high volume of data, including scientific and telemetry data, makes manual inspection challenging. Machine learning offers significant potential to meet these demands. The Euclid space telescope, in its survey phase since February 2024, exemplifies this shift. Euclid's success depends on accurate monitoring and interpretation of housekeeping telemetry and science-derived data. Thousands of telemetry parameters, monitored as time series, may or may not impact the quality of scientific data. These parameters have complex interdependencies, often due to physical relationships (e.g., proximity of temperature sensors). Optimising science operations requires careful anomaly detection and identification of hidden parameter states. Moreover, understanding the interactions between known anomalies and physical quantities is crucial yet complex, as related parameters may display anomalies with varied timing and intensity. We address these challenges by analysing temperature anomalies in Euclid's telemetry from February to August 2024, focusing on eleven temperature parameters and 35 covariates. We use a predictive XGBoost model to forecast temperatures based on historical values, detecting anomalies as deviations from predictions. A second XGBoost model predicts anomalies from covariates, capturing their relationships to temperature anomalies. We identify the top three anomalies per parameter and analyse their interactions with covariates using SHAP (Shapley Additive Explanations), enabling rapid, automated analysis of complex parameter relationships. Our method demonstrates how machine learning can enhance telemetry monitoring, offering scalable solutions for other missions with similar data challenges.
The effect of different feature selection methods on models created with XGBoost
Neyra, Jorge, Siramshetty, Vishal B., Ashqar, Huthaifa I.
This study examines the effect that different feature selection methods have on models created with XGBoost, a popular machine learning algorithm with superb regularization methods. It shows that three different ways for reducing the dimensionality of features produces no statistically significant change in the prediction accuracy of the model. This suggests that the traditional idea of removing the noisy training data to make sure models do not overfit may not apply to XGBoost. But it may still be viable in order to reduce computational complexity.
Sdn Intrusion Detection Using Machine Learning Method
Mahmud, Muhammad Zawad, Alve, Shahran Rahman, Islam, Samiha, Khan, Mohammad Monirujjaman
Software-defined network (SDN) is a new approach that allows network control to become directly programmable, and the underlying infrastructure can be abstracted from applications and network services. Control plane). When it comes to security, the centralization that this demands is ripe for a variety of cyber threats that are not typically seen in other network architectures. The authors in this research developed a novel machine-learning method to capture infections in networks. We applied the classifier to the UNSW-NB 15 intrusion detection benchmark and trained a model with this data. Random Forest and Decision Tree are classifiers used to assess with Gradient Boosting and AdaBoost. Out of these best-performing models was Gradient Boosting with an accuracy, recall, and F1 score of 99.87%,100%, and 99.85%, respectively, which makes it reliable in the detection of intrusions for SDN networks. The second best-performing classifier was also a Random Forest with 99.38% of accuracy, followed by Ada Boost and Decision Tree. The research shows that the reason that Gradient Boosting is so effective in this task is that it combines weak learners and creates a strong ensemble model that can predict if traffic belongs to a normal or malicious one with high accuracy. This paper indicates that the GBDT-IDS model is able to improve network security significantly and has better features in terms of both real-time detection accuracy and low false positive rates. In future work, we will integrate this model into live SDN space to observe its application and scalability. This research serves as an initial base on which one can make further strides forward to enhance security in SDN using ML techniques and have more secure, resilient networks.
Reconstructing MODIS Normalized Difference Snow Index Product on Greenland Ice Sheet Using Spatiotemporal Extreme Gradient Boosting Model
Ye, Fan, Cheng, Qing, Hao, Weifeng, Yu, Dayu
The spatiotemporally continuous data of normalized difference snow index (NDSI) are key to understanding the mechanisms of snow occurrence and development as well as the patterns of snow distribution changes. However, the presence of clouds, particularly prevalent in polar regions such as the Greenland Ice Sheet (GrIS), introduces a significant number of missing pixels in the MODIS NDSI daily data. To address this issue, this study proposes the utilization of a spatiotemporal extreme gradient boosting (STXGBoost) model generate a comprehensive NDSI dataset. In the proposed model, various input variables are carefully selected, encompassing terrain features, geometry-related parameters, and surface property variables. Moreover, the model incorporates spatiotemporal variation information, enhancing its capacity for reconstructing the NDSI dataset. Verification results demonstrate the efficacy of the STXGBoost model, with a coefficient of determination of 0.962, root mean square error of 0.030, mean absolute error of 0.011, and negligible bias (0.0001). Furthermore, simulation comparisons involving missing data and cross-validation with Landsat NDSI data illustrate the model's capability to accurately reconstruct the spatial distribution of NDSI data. Notably, the proposed model surpasses the performance of traditional machine learning models, showcasing superior NDSI predictive capabilities. This study highlights the potential of leveraging auxiliary data to reconstruct NDSI in GrIS, with implications for broader applications in other regions. The findings offer valuable insights for the reconstruction of NDSI remote sensing data, contributing to the further understanding of spatiotemporal dynamics in snow-covered regions.
Unlocking Your Sales Insights: Advanced XGBoost Forecasting Models for Amazon Products
Wang, Meng, Liu, Yuchen, Li, Gangmin, Payne, Terry R., Yue, Yong, Man, Ka Lok
One of the important factors of profitability is the volume of transactions. An accurate prediction of the future transaction volume becomes a pivotal factor in shaping corporate operations and decision-making processes. E-commerce has presented manufacturers with convenient sales channels to, with which the sales can increase dramatically. In this study, we introduce a solution that leverages the XGBoost model to tackle the challenge of predict-ing sales for consumer electronics products on the Amazon platform. Initial-ly, our attempts to solely predict sales volume yielded unsatisfactory results. However, by replacing the sales volume data with sales range values, we achieved satisfactory accuracy with our model. Furthermore, our results in-dicate that XGBoost exhibits superior predictive performance compared to traditional models.
Development and Comparative Analysis of Machine Learning Models for Hypoxemia Severity Triage in CBRNE Emergency Scenarios Using Physiological and Demographic Data from Medical-Grade Devices
Nanini, Santino, Abid, Mariem, Mamouni, Yassir, Wiedemann, Arnaud, Jouvet, Philippe, Bourassa, Stephane
This paper presents the development of machine learning (ML) models to predict hypoxemia severity during emergency triage, especially in Chemical, Biological, Radiological, Nuclear, and Explosive (CBRNE) events, using physiological data from medical-grade sensors. Gradient Boosting Models (XGBoost, LightGBM, CatBoost) and sequential models (LSTM, GRU) were trained on physiological and demographic data from the MIMIC-III and IV datasets. A robust preprocessing pipeline addressed missing data, class imbalances, and incorporated synthetic data flagged with masks. Gradient Boosting Models (GBMs) outperformed sequential models in terms of training speed, interpretability, and reliability, making them well-suited for real-time decision-making. While their performance was comparable to that of sequential models, the GBMs used score features from six physiological variables derived from the enhanced National Early Warning Score (NEWS) 2, which we termed NEWS2+. This approach significantly improved prediction accuracy. While sequential models handled temporal data well, their performance gains did not justify the higher computational cost. A 5-minute prediction window was chosen for timely intervention, with minute-level interpolations standardizing the data. Feature importance analysis highlighted the significant role of mask and score features in enhancing both transparency and performance. Temporal dependencies proved to be less critical, as Gradient Boosting Models were able to capture key patterns effectively without relying on them. This study highlights ML's potential to improve triage and reduce alarm fatigue. Future work will integrate data from multiple hospitals to enhance model generalizability across clinical settings.
A Systematic Review of Machine Learning in Sports Betting: Techniques, Challenges, and Future Directions
Galekwa, René Manassé, Tshimula, Jean Marie, Tajeuna, Etienne Gael, Kyandoghere, Kyamakya
The sports betting industry has experienced rapid growth, driven largely by technological advancements and the proliferation of online platforms. Machine learning (ML) has played a pivotal role in the transformation of this sector by enabling more accurate predictions, dynamic odds-setting, and enhanced risk management for both bookmakers and bettors. This systematic review explores various ML techniques, including support vector machines, random forests, and neural networks, as applied in different sports such as soccer, basketball, tennis, and cricket. These models utilize historical data, in-game statistics, and real-time information to optimize betting strategies and identify value bets, ultimately improving profitability. For bookmakers, ML facilitates dynamic odds adjustment and effective risk management, while bettors leverage data-driven insights to exploit market inefficiencies. This review also underscores the role of ML in fraud detection, where anomaly detection models are used to identify suspicious betting patterns. Despite these advancements, challenges such as data quality, real-time decision-making, and the inherent unpredictability of sports outcomes remain. Ethical concerns related to transparency and fairness are also of significant importance. Future research should focus on developing adaptive models that integrate multimodal data and manage risk in a manner akin to financial portfolios. This review provides a comprehensive examination of the current applications of ML in sports betting, and highlights both the potential and the limitations of these technologies.
Predicting Mortality and Functional Status Scores of Traumatic Brain Injury Patients using Supervised Machine Learning
Steinmetz, Lucas, Maheshwari, Shivam, Kazanjian, Garik, Loyson, Abigail, Alexander, Tyler, Margapuri, Venkat, Nataraj, C.
Traumatic brain injury (TBI) presents a significant public health challenge, often resulting in mortality or lasting disability. Predicting outcomes such as mortality and Functional Status Scale (FSS) scores can enhance treatment strategies and inform clinical decision-making. This study applies supervised machine learning (ML) methods to predict mortality and FSS scores using a real-world dataset of 300 pediatric TBI patients from the University of Colorado School of Medicine. The dataset captures clinical features, including demographics, injury mechanisms, and hospitalization outcomes. Eighteen ML models were evaluated for mortality prediction, and thirteen models were assessed for FSS score prediction. Performance was measured using accuracy, ROC AUC, F1-score, and mean squared error. Logistic regression and Extra Trees models achieved high precision in mortality prediction, while linear regression demonstrated the best FSS score prediction. Feature selection reduced 103 clinical variables to the most relevant, enhancing model efficiency and interpretability. This research highlights the role of ML models in identifying high-risk patients and supporting personalized interventions, demonstrating the potential of data-driven analytics to improve TBI care and integrate into clinical workflows.