AITopics

doi: 10.1016/j.scs.2025.106158

2505.09175

Country:

North America > United States (1.00)
Europe (0.92)
Asia > Middle East > Iran > Tehran Province > Tehran (0.27)

Genre:

Research Report > New Finding (1.00)
Workflow (0.92)

Industry:

Health & Medicine > Therapeutic Area (1.00)
Health & Medicine > Public Health (1.00)
Energy > Renewable (0.94)
(6 more...)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (0.86)

arXiv.org Machine LearningMay-13-2025

Improving Random Forests by Smoothing

Liu, Ziyi, Luong, Phuc, Boley, Mario, Schmidt, Daniel F.

Gaussian process regression is a popular model in the small data regime due to its sound uncertainty quantification and the exploitation of the smoothness of the regression function that is encountered in a wide range of practical problems. However, Gaussian processes perform sub-optimally when the degree of smoothness is non-homogeneous across the input domain. Random forest regression partially addresses this issue by providing local basis functions of variable support set sizes that are chosen in a data-driven way. However, they do so at the expense of forgoing any degree of smoothness, which often results in poor performance in the small data regime. Here, we aim to combine the advantages of both models by applying a kernel-based smoothing mechanism to a learned random forest or any other piecewise constant prediction function. As we demonstrate empirically, the resulting model consistently improves the predictive performance of the underlying random forests and, in almost all test cases, also improves the log loss of the usual uncertainty quantification based on inter-tree variance. The latter advantage can be attributed to the ability of the smoothing model to take into account the uncertainty over the exact tree-splitting locations.

artificial intelligence, machine learning, prediction, (17 more...)

arXiv.org Machine Learning

2505.06852

Country:

Oceania > Australia > Victoria > Melbourne (0.04)
Asia > Middle East > Israel > Haifa District > Haifa (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Genre: Research Report > New Finding (0.68)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Decision Tree Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.34)

Grayeli, Arya, Swarup, Vipin, Noel, Steven E.

Synthesizing Diverse Network Flow Datasets with Scalable Dynamic Multigraph Generation

arXiv.org Artificial IntelligenceMay-13-2025

Obtaining real-world network datasets is often challenging because of privacy, security, and computational constraints. In the absence of such datasets, graph generative models become essential tools for creating synthetic datasets. In this paper, we introduce a novel machine learning model for generating high-fidelity synthetic network flow datasets that are representative of real-world networks. Our approach involves the generation of dynamic multigraphs using a stochastic Kronecker graph generator for structure generation and a tabular generative adversarial network for feature generation. We further employ an XGBoost (eXtreme Gradient Boosting) model for graph alignment, ensuring accurate overlay of features onto the generated graph structure. We evaluate our model using new metrics that assess both the accuracy and diversity of the synthetic graphs. Our results demonstrate improvements in accuracy over previous large-scale graph generation methods while maintaining similar efficiency. We also explore the trade-off between accuracy and diversity in synthetic graph dataset creation, a topic not extensively covered in related works. Our contributions include the synthesis and evaluation of large real-world netflow datasets and the definition of new metrics for evaluating synthetic graph generative models.

artificial intelligence, dataset, machine learning, (18 more...)

2505.07777

Country: North America > United States (0.68)

Genre: Research Report > New Finding (0.86)

Industry:

Information Technology > Security & Privacy (1.00)
Government (0.93)

Technology:

Information Technology > Communications > Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.89)
Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (0.69)

Yamarthi, Dominic Parosh, Raman, Haripriya, Parvin, Shamsad

United States Road Accident Prediction using Random Forest Predictor

arXiv.org Artificial IntelligenceMay-13-2025

Road accidents significantly threaten public safety and require in-depth analysis for effective prevention and mitigation strategies. This paper focuses on predicting accidents through the examination of a comprehensive traffic dataset covering 49 states in the United States. The dataset integrates information from diverse sources, including transportation departments, law enforcement, and traffic sensors. This paper specifically emphasizes predicting the number of accidents, utilizing advanced machine learning models such as regression analysis and time series analysis. The inclusion of various factors, ranging from environmental conditions to human behavior and infrastructure, ensures a holistic understanding of the dynamics influencing road safety. Temporal and spatial analysis further allows for the identification of trends, seasonal variations, and high-risk areas. The implications of this research extend to proactive decision-making for policymakers and transportation authorities. By providing accurate predictions and quantifiable insights into expected accident rates under different conditions, the paper aims to empower authorities to allocate resources efficiently and implement targeted interventions. The goal is to contribute to the development of informed policies and interventions that enhance road safety, creating a safer environment for all road users. Keywords: Machine Learning, Random Forest, Accident Prediction, AutoML, LSTM.

accident, artificial intelligence, machine learning, (13 more...)

2505.06246

Country:

Asia (1.00)
North America > United States (0.72)
Africa > Middle East > Morocco (0.14)

Genre:

Research Report > New Finding (0.48)
Research Report > Experimental Study (0.34)

Industry:

Transportation > Infrastructure & Services (1.00)
Transportation > Ground > Road (1.00)
Law Enforcement & Public Safety > Crime Prevention & Enforcement (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.89)
Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (0.88)

arXiv.org Artificial IntelligenceMay-12-2025

Interactive Diabetes Risk Prediction Using Explainable Machine Learning: A Dash-Based Approach with SHAP, LIME, and Comorbidity Insights

Allani, Udaya

This study presents a web-based interactive health risk prediction tool designed to assess diabetes risk using machine learning models. Built on the 2015 CDC BRFSS dataset, the study evaluates models including Logistic Regression, Random Forest, XGBoost, LightGBM, KNN, and Neural Networks under original, SMOTE, and undersampling strategies. LightGBM with undersampling achieved the best recall, making it ideal for risk detection. The tool integrates SHAP and LIME to explain predictions and highlights comorbidity correlations using Pearson analysis. A Dash-based UI enables user-friendly interaction with model predictions, personalized suggestions, and feature insights, supporting data-driven health awareness.

artificial intelligence, explainable machine learning, interactive diabetes risk prediction, (4 more...)

2505.05683

Genre: Research Report (0.89)

Industry: Health & Medicine > Therapeutic Area > Endocrinology > Diabetes (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.53)
Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (0.53)

Lopez-Ledezma, Mateo, Velarde, Gissel

Cyber Security Data Science: Machine Learning Methods and their Performance on Imbalanced Datasets

arXiv.org Artificial IntelligenceMay-8-2025

Cybersecurity has become essential worldwide and at all levels, concerning individuals, institutions, and governments. A basic principle in cybersecurity is to be always alert. Therefore, automation is imperative in processes where the volume of daily operations is large. Several cybersecurity applications can be addressed as binary classification problems, including anomaly detection, fraud detection, intrusion detection, spam detection, or malware detection. We present three experiments. In the first experiment, we evaluate single classifiers including Random Forests, Light Gradient Boosting Machine, eXtreme Gradient Boosting, Logistic Regression, Decision Tree, and Gradient Boosting Decision Tree. In the second experiment, we test different sampling techniques including over-sampling, under-sampling, Synthetic Minority Over-sampling Technique, and Self-Paced Ensembling. In the last experiment, we evaluate Self-Paced Ensembling and its number of base classifiers. We found that imbalance learning techniques had positive and negative effects, as reported in related studies. Thus, these techniques should be applied with caution. Besides, we found different best performers for each dataset. Therefore, we recommend testing single classifiers and imbalance learning techniques for each new dataset and application involving imbalanced datasets as is the case in several cyber security applications.

artificial intelligence, classifier, machine learning, (18 more...)

2505.04204

Country: Europe (0.28)

Genre: Research Report > New Finding (0.35)

Industry:

Information Technology > Security & Privacy (1.00)
Government > Military > Cyberwarfare (0.75)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (1.00)

Solar Flare Forecast: A Comparative Analysis of Machine Learning Algorithms for Solar Flare Class Prediction

Bringewald, Julia

Solar flares are among the most powerful and dynamic events in the solar system, resulting from the sudden release of magnetic energy stored in the Sun's atmosphere. These energetic bursts of electromagnetic radiation can release up to 10^32 erg of energy, impacting space weather and posing risks to technological infrastructure and therefore require accurate forecasting of solar flare occurrences and intensities. This study evaluates the predictive performance of three machine learning algorithms: Random Forest, k-Nearest Neighbors (KNN), and Extreme Gradient Boosting (XGBoost) for classifying solar flares into 4 categories (B, C, M, X). Using the dataset of 13 SHARP parameters, the effectiveness of the models was evaluated in binary and multiclass classification tasks. The analysis utilized 8 principal components (PC), capturing 95% of data variance, and 100 PCs, capturing 97.5% of variance. Our approach uniquely combines binary and multiclass classification with different levels of dimensionality reduction, an innovative methodology not previously explored in the context of solar flare prediction. Employing a 10-fold stratified cross-validation and grid search for hyperparameter tuning ensured robust model evaluation. Our findings indicate that Random Forest and XGBoost consistently demonstrate strong performance across all metrics, benefiting significantly from increased dimensionality. The insights of this study enhance future research by optimizing dimensionality reduction techniques and informing model selection for astrophysical tasks. By integrating this newly acquired knowledge into future research, more accurate space weather forecasting systems can be developed, along with a deeper understanding of solar physics.

artificial intelligence, classification, machine learning, (15 more...)

2505.03385

Country: North America > United States (0.28)

Genre: Research Report > New Finding (1.00)

Industry: Energy (0.68)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.94)

Detecting Quishing Attacks with Machine Learning Techniques Through QR Code Analysis

Trad, Fouad, Chehab, Ali

The rise of QR code based phishing ("Quishing") poses a growing cybersecurity threat, as attackers increasingly exploit QR codes to bypass traditional phishing defenses. Existing detection methods predominantly focus on URL analysis, which requires the extraction of the QR code payload, and may inadvertently expose users to malicious content. Moreover, QR codes can encode various types of data beyond URLs, such as Wi-Fi credentials and payment information, making URL-based detection insufficient for broader security concerns. To address these gaps, we propose the first framework for quishing detection that directly analyzes QR code structure and pixel patterns without extracting the embedded content. We generated a dataset of phishing and benign QR codes and we used it to train and evaluate multiple machine learning models, including Logistic Regression, Decision Trees, Random Forest, Naive Bayes, LightGBM, and XGBoost. Our best-performing model (XGBoost) achieves an AUC of 0.9106, demonstrating the feasibility of QR-centric detection. Through feature importance analysis, we identify key visual indicators of malicious intent and refine our feature set by removing non-informative pixels, improving performance to an AUC of 0.9133 with a reduced feature space. Our findings reveal that the structural features of QR code correlate strongly with phishing risk. This work establishes a foundation for quishing mitigation and highlights the potential of direct QR analysis as a critical layer in modern phishing defenses.

artificial intelligence, machine learning, qr code, (16 more...)

2505.03451

Country:

Asia > Middle East > Lebanon > Beirut Governorate > Beirut (0.05)
Europe > Switzerland (0.04)
Asia > Japan > Kyūshū & Okinawa > Okinawa (0.04)

Genre: Research Report > New Finding (1.00)

Industry:

Information Technology > Security & Privacy (1.00)
Government > Military > Cyberwarfare (0.68)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (0.95)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.95)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.79)

EnsembleCI: Ensemble Learning for Carbon Intensity Forecasting

Yan, Leyi, Wang, Linda, Liu, Sihang, Ding, Yi

Carbon intensity (CI) measures the average carbon emissions generated per unit of electricity, making it a crucial metric for quantifying and managing the environmental impact. Accurate CI predictions are vital for minimizing carbon footprints, yet the state-of-the-art method (CarbonCast) falls short due to its inability to address regional variability and lack of adaptability. To address these limitations, we introduce EnsembleCI, an adaptive, end-to-end ensemble learning-based approach for CI forecasting. EnsembleCI combines weighted predictions from multiple sublearners, offering enhanced flexibility and regional adaptability. In evaluations across 11 regional grids, EnsembleCI consistently surpasses CarbonCast, achieving the lowest mean absolute percentage error (MAPE) in almost all grids and improving prediction accuracy by an average of 19.58%. While performance still varies across grids due to inherent regional diversity, EnsembleCI reduces variability and exhibits greater robustness in long-term forecasting compared to CarbonCast and identifies region-specific key features, underscoring its interpretability and practical relevance. These findings position EnsembleCI as a more accurate and reliable solution for CI forecasting. EnsembleCI source code and data used in this paper are available at https://github.com/emmayly/EnsembleCI.

artificial intelligence, data mining, machine learning, (18 more...)

doi: 10.1145/3679240.3734630

2505.01959

Country:

Europe > Netherlands > South Holland > Rotterdam (0.05)
Europe > Spain (0.04)
North America > United States > New York > New York County > New York City (0.04)
(5 more...)

Genre: Research Report (1.00)

Industry:

Energy > Power Industry (1.00)
Government > Regional Government > North America Government > United States Government (0.47)
Energy > Renewable > Solar (0.47)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Modeling & Simulation (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.70)
Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (0.47)

Stempfle, Lena, Matsson, Anton, Mwai, Newton, Johansson, Fredrik D.

Prediction Models That Learn to Avoid Missing Values

Handling missing values at test time is challenging for machine learning models, especially when aiming for both high accuracy and interpretability. Established approaches often add bias through imputation or excessive model complexity via missingness indicators. Moreover, either method can obscure interpretability, making it harder to understand how the model utilizes the observed variables in predictions. We propose missingness-avoiding (MA) machine learning, a general framework for training models to rarely require the values of missing (or imputed) features at test time. We create tailored MA learning algorithms for decision trees, tree ensembles, and sparse linear models by incorporating classifier-specific regularization terms in their learning objectives. The tree-based models leverage contextual missingness by reducing reliance on missing values based on the observed context. Experiments on real-world datasets demonstrate that MA-DT, MA-LASSO, MA-RF, and MA-GBT effectively reduce the reliance on features with missing values while maintaining predictive performance competitive with their unregularized counterparts. This shows that our framework gives practitioners a powerful tool to maintain interpretability in predictions with test-time missing values.

artificial intelligence, machine learning, optimization problem, (17 more...)

2505.03393

Country: Europe (0.46)

Genre: Research Report (1.00)

Industry:

Health & Medicine > Therapeutic Area > Neurology (1.00)
Health & Medicine > Diagnostic Medicine > Imaging (0.93)
Banking & Finance (0.93)
(2 more...)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Decision Tree Learning (0.90)
(3 more...)