AITopics

Genre: Research Report > New Finding (0.60)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (0.64)

Neural Information Processing SystemsDec-24-2025, 12:27:36 GMT

An Efficient Adversarial Attack for Tree Ensembles

We study the problem of efficient adversarial attacks on tree based ensembles such as gradient boosting decision trees (GBDTs) and random forests (RFs). Since these models are non-continuous step functions and gradient does not exist, most existing efficient adversarial attacks are not applicable. Although decision-based black-box attacks can be applied, they cannot utilize the special structure of trees. In our work, we transform the attack problem into a discrete search problem specially designed for tree ensembles, where the goal is to find a valid ``leaf tuple'' that leads to mis-classification while having the shortest distance to the original input. With this formulation, we show that a simple yet effective greedy algorithm can be applied to iteratively optimize the adversarial example by moving the leaf tuple to its neighborhood within hamming distance 1. Experimental results on several large GBDT and RF models with up to hundreds of trees demonstrate that our method can be thousands of times faster than the previous mixed-integer linear programming (MILP) based approach, while also providing smaller (better) adversarial examples than decision-based black-box attacks on general $\ell_p$ ($p=1, 2, \infty$) norm perturbations.

efficient adversarial attack, name change, tree ensemble, (8 more...)

Industry:

Information Technology > Security & Privacy (0.90)
Government > Military (0.90)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Search (0.60)
Information Technology > Artificial Intelligence > Representation & Reasoning > Mathematical & Statistical Methods (0.60)
Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (0.60)
Information Technology > Artificial Intelligence > Machine Learning > Decision Tree Learning (0.60)

Neural Information Processing SystemsDec-24-2025, 06:06:00 GMT

Smooth And Consistent Probabilistic Regression Trees

We propose here a generalization of regression trees, referred to as Probabilistic Regression (PR) trees, that adapt to the smoothness of the prediction function relating input and output variables while preserving the interpretability of the prediction and being robust to noise. In PR trees, an observation is associated to all regions of a tree through a probability distribution that reflects how far the observation is to a region. We show that such trees are consistent, meaning that their error tends to 0 when the sample size tends to infinity, a property that has not been established for similar, previous proposals as Soft trees and Smooth Transition Regression trees. We further explain how PR trees can be used in different ensemble methods, namely Random Forests and Gradient Boosted Trees. Lastly, we assess their performance through extensive experiments that illustrate their benefits in terms of performance, interpretability and robustness to noise.

consistent probabilistic regression tree, electronic proceedings, name change, (3 more...)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Decision Tree Learning (0.88)
Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (0.61)

Neural Information Processing SystemsDec-24-2025, 05:38:45 GMT

SnapBoost: A Heterogeneous Boosting Machine

Modern gradient boosting software frameworks, such as XGBoost and LightGBM, implement Newton descent in a functional space. At each boosting iteration, their goal is to find the base hypothesis, selected from some base hypothesis class, that is closest to the Newton descent direction in a Euclidean sense. Typically, the base hypothesis class is fixed to be all binary decision trees up to a given depth. In this work, we study a Heterogeneous Newton Boosting Machine (HNBM) in which the base hypothesis class may vary across boosting iterations. Specifically, at each boosting iteration, the base hypothesis class is chosen, from a fixed set of subclasses, by sampling from a probability distribution. We derive a global linear convergence rate for the HNBM under certain assumptions, and show that it agrees with existing rates for Newton's method when the Newton direction can be perfectly fitted by the base hypothesis at each boosting iteration. We then describe a particular realization of a HNBM, SnapBoost, that, at each boosting iteration, randomly selects between either a decision tree of variable depth or a linear regressor with random Fourier features. We describe how SnapBoost is implemented, with a focus on the training complexity. Finally, we present experimental results, using OpenML and Kaggle datasets, that show that SnapBoost is able to achieve better generalization loss than competing boosting frameworks, without taking significantly longer to tune.

base hypothesis class, iteration, snapboost, (6 more...)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (0.59)

Neural Information Processing SystemsDec-24-2025, 02:47:50 GMT

Fast, Accurate, and Simple Models for Tabular Data via Augmented Distillation

Automated machine learning (AutoML) can produce complex model ensembles by stacking, bagging, and boosting many individual models like trees, deep networks, and nearest neighbor estimators. While highly accurate, the resulting predictors are large, slow, and opaque as compared to their constituents. To improve the deployment of AutoML on tabular data, we propose FAST-DAD to distill arbitrarily-complex ensemble predictors into individual models like boosted trees, random forests, and deep networks. At the heart of our approach is a data augmentation strategy based on Gibbs sampling from a self-attention pseudolikelihood estimator. Across 30 datasets spanning regression and binary/multiclass classification tasks, FAST-DAD distillation produces significantly better individual models than one obtains through standard training on the original data. Our individual distilled models are over 10x faster and more accurate than ensemble predictors produced by AutoML tools like H2O/AutoSklearn.

artificial intelligence, machine learning, proceedings, (7 more...)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (0.61)

Golchian, Pegah, Wright, Marvin N.

Imputation Uncertainty in Interpretable Machine Learning Methods

arXiv.org Machine LearningDec-22-2025

In real data, missing values occur frequently, which affects the interpretation with interpretable machine learning (IML) methods. Recent work considers bias and shows that model explanations may differ between imputation methods, while ignoring additional imputation uncertainty and its influence on variance and confidence intervals. We therefore compare the effects of different imputation methods on the confidence interval coverage probabilities of the IML methods permutation feature importance, partial dependence plots and Shapley values. We show that single imputation leads to underestimation of variance and that, in most cases, only multiple imputation is close to nominal coverage.

imputation, imputation uncertainty, mean imputation, (14 more...)

2512.17689

Country:

Europe > Germany > Bremen > Bremen (0.14)
Asia > Middle East > Jordan (0.04)

Genre: Research Report (0.82)

Industry: Health & Medicine (0.68)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (0.50)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.49)

arXiv.org Machine LearningDec-12-2025

Maximum Risk Minimization with Random Forests

Freni, Francesco, Fries, Anya, Kühne, Linus, Reichstein, Markus, Peters, Jonas

We consider a regression setting where observations are collected in different environments modeled by different data distributions. The field of out-of-distribution (OOD) generalization aims to design methods that generalize better to test environments whose distributions differ from those observed during training. One line of such works has proposed to minimize the maximum risk across environments, a principle that we refer to as MaxRM (Maximum Risk Minimization). In this work, we introduce variants of random forests based on the principle of MaxRM. We provide computationally efficient algorithms and prove statistical consistency for our primary method. Our proposed method can be used with each of the following three risks: the mean squared error, the negative reward (which relates to the explained variance), and the regret (which quantifies the excess risk relative to the best predictor). For MaxRM with regret as the risk, we prove a novel out-of-sample guarantee over unseen test distributions. Finally, we evaluate the proposed methods on both simulated and real-world data.

maximum mse, maxrm-rf, mse, (14 more...)

2512.10445

Country:

Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
North America > United States > California > San Francisco County > San Francisco (0.04)
Europe > Switzerland (0.04)
(4 more...)

Genre: Research Report > Experimental Study (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Decision Tree Learning (0.87)
Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (0.73)

arXiv.org Artificial IntelligenceDec-11-2025

Predicting the Containment Time of California Wildfires Using Machine Learning

Bhardwaj, Shashank

California's wildfire season keeps getting worse over the years, overwhelming the emergency response teams. These fires cause massive destruction to both property and human life. Because of these reasons, there's a growing need for accurate and practical predictions that can help assist with resources allocation for the Wildfire managers or the response teams. In this research, we built machine learning models to predict the number of days it will require to fully contain a wildfire in California. Here, we addressed an important gap in the current literature. Most prior research has concentrated on wildfire risk or how fires spread, and the few that examine the duration typically predict it in broader categories rather than a continuous measure. This research treats the wildfire duration prediction as a regression task, which allows for more detailed and precise forecasts rather than just the broader categorical predictions used in prior work. We built the models by combining three publicly available datasets from California Department of Forestry and Fire Protection's Fire and Resource Assessment Program (FRAP). This study compared the performance of baseline ensemble regressor, Random Forest and XGBoost, with a Long Short-Term Memory (LSTM) neural network. The results show that the XGBoost model slightly outperforms the Random Forest model, likely due to its superior handling of static features in the dataset. The LSTM model, on the other hand, performed worse than the ensemble models because the dataset lacked temporal features. Overall, this study shows that, depending on the feature availability, Wildfire managers or Fire management authorities can select the most appropriate model to accurately predict wildfire containment duration and allocate resources effectively.

artificial intelligence, deep learning, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2512.09835

Country: North America > United States > California (1.00)

Genre: Research Report > New Finding (0.66)

Industry:

Law Enforcement & Public Safety > Fire & Emergency Services (1.00)
Government > Regional Government > North America Government > United States Government (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (1.00)

Zou, Yating, Matabuena, Marcos, Kosorok, Michael R.

Distributional Random Forests for Complex Survey Designs on Reproducing Kernel Hilbert Spaces

arXiv.org Machine LearningDec-10-2025

We study estimation of the conditional law $P(Y|X=\mathbf{x})$ and continuous functionals $Ψ(P(Y|X=\mathbf{x}))$ when $Y$ takes values in a locally compact Polish space, $X \in \mathbb{R}^p$, and the observations arise from a complex survey design. We propose a survey-calibrated distributional random forest (SDRF) that incorporates complex-design features via a pseudo-population bootstrap, PSU-level honesty, and a Maximum Mean Discrepancy (MMD) split criterion computed from kernel mean embeddings of Hájek-type (design-weighted) node distributions. We provide a framework for analyzing forest-style estimators under survey designs; establish design consistency for the finite-population target and model consistency for the super-population target under explicit conditions on the design, kernel, resampling multipliers, and tree partitions. As far as we are aware, these are the first results on model-free estimation of conditional distributions under survey designs. Simulations under a stratified two-stage cluster design provide finite sample performance and demonstrate the statistical error price of ignoring the survey design. The broad applicability of SDRF is demonstrated using NHANES: We estimate the tolerance regions of the conditional joint distribution of two diabetes biomarkers, illustrating how distributional heterogeneity can support subgroup-specific risk profiling for diabetes mellitus in the U.S. population.

conditional distribution, finite population, random forest, (17 more...)

2512.08179

Country:

North America > United States (0.48)
Europe > Austria > Vienna (0.14)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
(2 more...)

Genre: Research Report > Experimental Study (0.93)

Industry: Health & Medicine > Therapeutic Area > Endocrinology > Diabetes (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Decision Tree Learning (0.87)
Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (0.63)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.46)

Mostafa, Fahad, Khan, Hafiz

Functional Random Forest with Adaptive Cost-Sensitive Splitting for Imbalanced Functional Data Classification

arXiv.org Machine LearningDec-10-2025

Classification of functional data where observations are curves or trajectories poses unique challenges, particularly under severe class imbalance. Traditional Random Forest algorithms, while robust for tabular data, often fail to capture the intrinsic structure of functional observations and struggle with minority class detection. This paper introduces Functional Random Forest with Adaptive Cost-Sensitive Splitting (FRF-ACS), a novel ensemble framework designed for imbalanced functional data classification. The proposed method leverages basis expansions and Functional Principal Component Analysis (FPCA) to represent curves efficiently, enabling trees to operate on low dimensional functional features. To address imbalance, we incorporate a dynamic cost sensitive splitting criterion that adjusts class weights locally at each node, combined with a hybrid sampling strategy integrating functional SMOTE and weighted bootstrapping. Additionally, curve specific similarity metrics replace traditional Euclidean measures to preserve functional characteristics during leaf assignment. Extensive experiments on synthetic and real world datasets including biomedical signals and sensor trajectories demonstrate that FRF-ACS significantly improves minority class recall and overall predictive performance compared to existing functional classifiers and imbalance handling techniques. This work provides a scalable, interpretable solution for high dimensional functional data analysis in domains where minority class detection is critical.

classifier, dataset, imbalance, (16 more...)

2512.07888

Country:

North America > United States > Texas (0.04)
North America > United States > Arizona (0.04)

Genre:

Research Report > Experimental Study (0.67)
Research Report > New Finding (0.46)

Industry: Health & Medicine (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Decision Tree Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (0.82)