AITopics

2508.00027

Country: Oceania > Australia (0.14)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Decision Tree Learning (0.72)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (0.62)
Information Technology > Artificial Intelligence > Representation & Reasoning > Personal Assistant Systems (0.51)

arXiv.org Artificial IntelligenceAug-1-2025

SHAP-Guided Regularization in Machine Learning Models

Saadallah, Amal

Feature attribution methods such as SHapley Additive exPlanations (SHAP) have become instrumental in understanding machine learning models, but their role in guiding model optimization remains underexplored. In this paper, we propose a SHAP-guided regularization framework that incorporates feature importance constraints into model training to enhance both predictive performance and interpretability. Our approach applies entropy-based penalties to encourage sparse, concentrated feature attributions while promoting stability across samples. The framework is applicable to both regression and classification tasks. Our first exploration started with investigating a tree-based model regularization using TreeSHAP. Through extensive experiments on benchmark regression and classification datasets, we demonstrate that our method improves generalization performance while ensuring robust and interpretable feature attributions. The proposed technique offers a novel, explainability-driven regularization approach, making machine learning models both more accurate and more reliable.

artificial intelligence, interpretability, machine learning, (15 more...)

2507.23665

Country:

Europe > Germany > North Rhine-Westphalia (0.14)
Asia > Middle East > Republic of Türkiye (0.14)

Genre: Research Report > New Finding (0.46)

Industry: Health & Medicine (0.70)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (0.50)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.48)

arXiv.org Artificial IntelligenceJul-29-2025

Robust Taxi Fare Prediction Under Noisy Conditions: A Comparative Study of GAT, TimesNet, and XGBoost

Moorthy, Padmavathi

--Precise fare prediction is crucial in ride-hailing platforms and urban mobility systems. This study examines three machine learning models--Graph Attention Networks (GA T), XGBoost, and TimesNet--to evaluate their predictive capabilities for taxi fares using a real-world dataset comprising over 55 million records. Both raw (noisy) and denoised versions of the dataset are analyzed to assess the impact of data quality on model performance. The study evaluated the models along multiple axes, including predictive accuracy, calibration, uncertainty estimation, out-of-distribution (OOD) robustness, and feature sensitivity. We also explore pre-processing strategies, including KNN imputation, Gaussian noise injection, and autoencoder-based denoising. The study reveals critical differences between classical and deep learning models under realistic conditions, offering practical guidelines for building robust and scalable models in urban fare prediction systems. Index T erms--T axi Fare Prediction, Machine Learning, Graph Attention Network, XGBoost, Time Series, Uncertainty Estimation, Ensemble Models, Kolmogorov-Smirnov (KS), Out-of-Distribution (OOD). A. Background and Motivation Accurately estimating taxi fares plays a pivotal role in intelligent transportation systems and urban mobility planning.

artificial intelligence, deep learning, machine learning, (16 more...)

2507.20008

Genre: Research Report > Experimental Study (0.34)

Industry:

Transportation > Passenger (1.00)
Transportation > Ground > Road (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (1.00)

Azam, Md Basit, Singh, Sarangthem Ibotombi

Clinical-Grade Blood Pressure Prediction in ICU Settings: An Ensemble Framework with Uncertainty Quantification and Cross-Institutional Validation

arXiv.org Artificial IntelligenceJul-29-2025

Blood pressure (BP) monitoring is critical in in tensive care units (ICUs) where hemodynamic instability can rapidly progress to cardiovascular collapse. Current machine learning (ML) approaches suffer from three limitations: lack of external validation, absence of uncertainty quantification, and inadequate data leakage prevention. This study presents the first comprehensive framework with novel algorithmic leakage prevention, uncertainty quantification, and cross-institutional validation for electronic health records (EHRs) based BP pre dictions. Our methodology implemented systematic data leakage prevention, uncertainty quantification through quantile regres sion, and external validation between the MIMIC-III and eICU databases. An ensemble framework combines Gradient Boosting, Random Forest, and XGBoost with 74 features across five physiological domains. Internal validation achieved a clinically acceptable performance (for SBP: R^2 = 0.86, RMSE = 6.03 mmHg; DBP: R^2 = 0.49, RMSE = 7.13 mmHg), meeting AAMI standards. External validation showed 30% degradation with critical limitations in patients with hypotensive. Uncertainty quantification generated valid prediction intervals (80.3% SBP and 79.9% DBP coverage), enabling risk-stratified protocols with narrow intervals (< 15 mmHg) for standard monitoring and wide intervals (> 30 mmHg) for manual verification. This framework provides realistic deployment expectations for cross institutional AI-assisted BP monitoring in critical care settings. The source code is publicly available at https://github.com/ mdbasit897/clinical-bp-prediction-ehr.

artificial intelligence, machine learning, mmhg, (11 more...)

2507.1953

Country: North America > United States (0.69)

Genre: Research Report > Experimental Study (0.69)

Industry:

Health & Medicine > Therapeutic Area > Cardiology/Vascular Diseases (1.00)
Government > Regional Government > North America Government > United States Government (0.47)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (0.89)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.67)

arXiv.org Machine LearningJul-28-2025

Learning Causally Predictable Outcomes from Psychiatric Longitudinal Data

Strobl, Eric V.

Causal inference in longitudinal biomedical data remains a central challenge, especially in psychiatry, where symptom heterogeneity and latent confounding frequently undermine classical estimators. Most existing methods for treatment effect estimation presuppose a fixed outcome variable and address confounding through observed covariate adjustment. However, the assumption of unconfoundedness may not hold for a fixed outcome in practice. To address this foundational limitation, we directly optimize the outcome definition to maximize causal identifiability. Our DEBIAS (Durable Effects with Backdoor-Invariant Aggregated Symptoms) algorithm learns non-negative, clinically interpretable weights for outcome aggregation, maximizing durable treatment effects and empirically minimizing both observed and latent confounding by leveraging the time-limited direct effects of prior treatments in psychiatric longitudinal data. The algorithm also furnishes an empirically verifiable test for outcome unconfoundedness. DEBIAS consistently outperforms state-of-the-art methods in recovering causal effects for clinically interpretable composite outcomes across comprehensive experiments in depression and schizophrenia.

artificial intelligence, correlation, machine learning, (18 more...)

2506.16629

Country:

North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Genre:

Research Report > Experimental Study (1.00)
Research Report > New Finding (0.93)
Research Report > Strength High (0.68)

Industry:

Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
Health & Medicine > Therapeutic Area > Psychiatry/Psychology > Mental Health (0.93)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (0.46)

Sousa, Lisa Barros de Andrade e, Miller, Gregor, Gleut, Ronan Le, Thalmeier, Dominik, Pelin, Helena, Piraud, Marie

Forest-Guided Clustering -- Shedding Light into the Random Forest Black Box

arXiv.org Artificial IntelligenceJul-28-2025

As machine learning models are increasingly deployed in sensitive application areas, the demand for interpretable and trustworthy decision-making has increased. Random Forests (RF), despite their widespread use and strong performance on tabular data, remain difficult to interpret due to their ensemble nature. We present Forest-Guided Clustering (FGC), a model-specific explainability method that reveals both local and global structure in RFs by grouping instances according to shared decision paths. FGC produces human-interpretable clusters aligned with the model's internal logic and computes cluster-specific and global feature importance scores to derive decision rules underlying RF predictions. FGC accurately recovered latent subclass structure on a benchmark dataset and outperformed classical clustering and post-hoc explanation methods. Applied to an AML transcriptomic dataset, FGC uncovered biologically coherent subpopulations, disentangled disease-relevant signals from confounders, and recovered known and novel gene expression patterns. FGC bridges the gap between performance and interpretability by providing structure-aware insights that go beyond feature-level attribution.

artificial intelligence, decision tree learning, machine learning, (18 more...)

2507.19455

Country:

North America > United States (0.46)
Europe > Germany > Bavaria (0.14)

Genre: Research Report (1.00)

Industry:

Law (1.00)
Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (1.00)
Health & Medicine > Therapeutic Area > Immunology (1.00)
(5 more...)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (0.85)
Information Technology > Artificial Intelligence > Machine Learning > Decision Tree Learning (0.71)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.46)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.46)

Pinheiro, João Manoel Herrera, de Oliveira, Suzana Vilas Boas, Silva, Thiago Henrique Segreto, Saraiva, Pedro Antonio Rabelo, de Souza, Enzo Ferreira, Godoy, Ricardo V., Ambrosio, Leonardo André, Becker, Marcelo

The Impact of Feature Scaling In Machine Learning: Effects on Regression and Classification Tasks

arXiv.org Machine LearningJul-24-2025

This research addresses the critical lack of comprehensive studies on feature scaling by systematically evaluating 12 scaling techniques - including several less common transformations - across 14 different Machine Learning algorithms and 16 datasets for classification and regression tasks. We meticulously analyzed impacts on predictive performance (using metrics such as accuracy, MAE, MSE, and $R^2$) and computational costs (training time, inference time, and memory usage). Key findings reveal that while ensemble methods (such as Random Forest and gradient boosting models like XGBoost, CatBoost and LightGBM) demonstrate robust performance largely independent of scaling, other widely used models such as Logistic Regression, SVMs, TabNet, and MLPs show significant performance variations highly dependent on the chosen scaler. This extensive empirical analysis, with all source code, experimental results, and model parameters made publicly available to ensure complete transparency and reproducibility, offers model-specific crucial guidance to practitioners on the need for an optimal selection of feature scaling techniques.

artificial intelligence, lgbm 0, machine learning, (17 more...)

2506.08274

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.92)
(2 more...)

Cotorobai, Alexandre, Silva, Jorge Miguel, Oliveira, Jose Luis

A Federated Random Forest Solution for Secure Distributed Machine Learning

arXiv.org Artificial IntelligenceJul-23-2025

Privacy and regulatory barriers often hinder centralized machine learning solutions, particularly in sectors like healthcare where data cannot be freely shared. Federated learning has emerged as a powerful paradigm to address these concerns; however, existing frameworks primarily support gradient-based models, leaving a gap for more interpretable, tree-based approaches. This paper introduces a federated learning framework for Random Forest classifiers that preserves data privacy and provides robust performance in distributed settings. By leveraging PySyft for secure, privacy-aware computation, our method enables multiple institutions to collaboratively train Random Forest models on locally stored data without exposing sensitive information. The framework supports weighted model averaging to account for varying data distributions, incremental learning to progressively refine models, and local evaluation to assess performance across heterogeneous datasets. Experiments on two real-world healthcare benchmarks demonstrate that the federated approach maintains competitive predictive accuracy - within a maximum 9\% margin of centralized methods - while satisfying stringent privacy requirements. These findings underscore the viability of tree-based federated learning for scenarios where data cannot be centralized due to regulatory, competitive, or technical constraints. The proposed solution addresses a notable gap in existing federated learning libraries, offering an adaptable tool for secure distributed machine learning tasks that demand both transparency and reliable performance. The tool is available at https://github.com/ieeta-pt/fed_rf.

artificial intelligence, decision tree learning, machine learning, (16 more...)

doi: 10.1109/CBMS65348.2025.00159

2505.08085

Country: Europe > Portugal (0.15)

Genre: Research Report > New Finding (1.00)

Industry:

Law (1.00)
Information Technology > Security & Privacy (1.00)
Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (0.47)
Health & Medicine > Therapeutic Area > Immunology (0.47)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Decision Tree Learning (1.00)

Chandra, Satyankar, Gupta, Ashutosh, Mallik, Kaushik, Shankaranarayanan, Krishna, Varshney, Namrita

Glitches in Decision Tree Ensemble Models

arXiv.org Machine LearningJul-22-2025

Many critical decision-making tasks are now delegated to machine-learned models, and it is imperative that their decisions are trustworthy and reliable, and their outputs are consistent across similar inputs. We identify a new source of unreliable behaviors-called glitches-which may significantly impair the reliability of AI models having steep decision boundaries. Roughly speaking, glitches are small neighborhoods in the input space where the model's output abruptly oscillates with respect to small changes in the input. We provide a formal definition of glitches, and use well-known models and datasets from the literature to demonstrate that they have widespread existence and argue they usually indicate potential model inconsistencies in the neighborhood of where they are found. We proceed to the algorithmic search of glitches for widely used gradient-boosted decision tree (GBDT) models. We prove that the problem of detecting glitches is NP-complete for tree ensembles, already for trees of depth 4. Our glitch-search algorithm for GBDT models uses an MILP encoding of the problem, and its effectiveness and computational feasibility are demonstrated on a set of widely used GBDT benchmarks taken from the literature.

artificial intelligence, glitch, machine learning, (17 more...)

2507.14492

Country:

North America > United States > California > San Francisco County > San Francisco (0.14)
Asia > India > Maharashtra > Mumbai (0.04)
North America > United States > New Jersey > Hudson County > Hoboken (0.04)
(6 more...)

Genre: Research Report > New Finding (0.46)

Industry: Health & Medicine > Therapeutic Area (0.94)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.66)
Information Technology > Artificial Intelligence > Representation & Reasoning > Diagnosis (0.62)
Information Technology > Artificial Intelligence > Machine Learning > Decision Tree Learning (0.62)
(3 more...)

Golchian, Pegah, Kapar, Jan, Watson, David S., Wright, Marvin N.

Missing value imputation with adversarial random forests -- MissARF

arXiv.org Machine LearningJul-22-2025

Handling missing values is a common challenge in biostatistical analyses, typically addressed by imputation methods. We propose a novel, fast, and easy-to-use imputation method called missing value imputation with adversarial random forests (MissARF), based on generative machine learning, that provides both single and multiple imputation. MissARF employs adversarial random forest (ARF) for density estimation and data synthesis. To impute a missing value of an observation, we condition on the non-missing values and sample from the estimated conditional distribution generated by ARF. Our experiments demonstrate that MissARF performs comparably to state-of-the-art single and multiple imputation methods in terms of imputation quality and fast runtime with no additional costs for multiple imputation.

artificial intelligence, imputation, machine learning, (16 more...)

2507.15681

Country:

North America > United States > Wyoming > Albany County > Laramie (0.14)
Europe > Germany > Bremen > Bremen (0.14)
Europe > United Kingdom > England > Greater London > London (0.04)
Europe > Denmark > Capital Region > Copenhagen (0.04)

Genre: Research Report > New Finding (0.67)

Industry: Health & Medicine > Public Health (0.45)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Decision Tree Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)