AITopics | Ensemble Learning

Collaborating Authors

Ensemble Learning

Ensemble methods use multiple learning algorithms to obtain better predictive performance than could be obtained from any of the constituent learning algorithms alone. (Wikipedia)

News Overviews Instructional Materials AI-Alerts Classics

Cost efficient gradient boosting

Sven Peter, Ferran Diego, Fred A. Hamprecht, Boaz Nadler

Neural Information Processing SystemsNov-21-2025, 08:51:47 GMT

Neural Information Processing Systems http://nips.cc/

algorithm, cost penalty, gradient, (16 more...)

Neural Information Processing Systems

Country:

Europe > Germany > Baden-Württemberg > Karlsruhe Region > Heidelberg (0.04)
Oceania > New Zealand > North Island > Waikato (0.04)
North America > United States > New York > New York County > New York City (0.04)
(4 more...)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (0.85)
Information Technology > Artificial Intelligence > Machine Learning > Decision Tree Learning (0.72)

Add feedback

Boosting Predictive Performance on Tabular Data through Data Augmentation with Latent-Space Flow-Based Diffusion

Ihsan, Md. Tawfique, Rafi, Md. Rakibul Hasan, Raihan, Ahmed Shoyeb, Ahmed, Imtiaz, Azeem, Abdullahil

arXiv.org Artificial IntelligenceNov-21-2025

Severe class imbalance is common in real-world tabular learning, where rare but important minority classes are essential for reliable prediction. Existing generative oversampling methods such as GANs, VAEs, and diffusion models can improve minority-class performance, but they often struggle with tabular heterogeneity, training stability, and privacy concerns. We propose a family of latent-space, tree-driven diffusion methods for minority oversampling that use conditional flow matching with gradient-boosted trees as the vector-field learner. The models operate in compact latent spaces to preserve tabular structure and reduce computation. We introduce three variants: PCAForest, which uses linear PCA embedding; EmbedForest, which uses a learned nonlinear embedding; and AttentionForest, which uses an attention-augmented embedding. Each method couples a GBT-based flow with a decoder back to the original feature space. Across 11 datasets from healthcare, finance, and manufacturing, AttentionForest achieves the best average minority recall while maintaining competitive precision, calibration, and distributional similarity. PCAForest and EmbedForest reach similar utility with much faster generation, offering favorable accuracy-efficiency trade-offs. Privacy evaluated with nearest-neighbor distance ratio and distance-to-closest-record is comparable to or better than the ForestDiffusion baseline. Ablation studies show that smaller embeddings tend to improve minority recall, while aggressive learning rates harm stability. Overall, latent-space, tree-driven diffusion provides an efficient and privacy-aware approach to high-fidelity tabular data augmentation under severe class imbalance.

artificial intelligence, deep learning, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2511.16571

Country: Asia (0.27)

Genre:

Research Report (1.00)
Overview (0.92)

Industry:

Information Technology > Security & Privacy (1.00)
Health & Medicine (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (0.89)

Add feedback

Correlation-Aware Feature Attribution Based Explainable AI

Sengupta, Poushali, Zhang, Yan, Eliassen, Frank, Maharjan, Sabita

arXiv.org Machine LearningNov-21-2025

Explainable AI (XAI) is increasingly essential as modern models become more complex and high-stakes applications demand transparency, trust, and regulatory compliance. Existing global attribution methods often incur high computational costs, lack stability under correlated inputs, and fail to scale efficiently to large or heterogeneous datasets. We address these gaps with \emph{ExCIR} (Explainability through Correlation Impact Ratio), a correlation-aware attribution score equipped with a lightweight transfer protocol that reproduces full-model rankings using only a fraction of the data. ExCIR quantifies sign-aligned co-movement between features and model outputs after \emph{robust centering} (subtracting a robust location estimate, e.g., median or mid-mean, from features and outputs). We further introduce \textsc{BlockCIR}, a \emph{groupwise} extension of ExCIR that scores \emph{sets} of correlated features as a single unit. By aggregating the same signed-co-movement numerators and magnitudes over predefined or data-driven groups, \textsc{BlockCIR} mitigates double-counting in collinear clusters (e.g., synonyms or duplicated sensors) and yields smoother, more stable rankings when strong dependencies are present. Across diverse text, tabular, signal, and image datasets, ExCIR shows trustworthy agreement with established global baselines and the full model, delivers consistent top-$k$ rankings across settings, and reduces runtime via lightweight evaluation on a subset of rows. Overall, ExCIR provides \emph{computationally efficient}, \emph{consistent}, and \emph{scalable} explainability for real-world deployment.

cir, machine learning, natural language, (20 more...)

arXiv.org Machine Learning

2511.16482

Country: North America > United States (0.29)

Genre: Research Report (1.00)

Industry: Health & Medicine (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language (0.70)
Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (0.68)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.68)

Add feedback

Multi-Layered Gradient Boosting Decision Trees

Neural Information Processing SystemsNov-20-2025, 22:03:01 GMT

Multi-layered distributed representation is believed to be the key ingredient of deep neural networks especially in cognitive tasks like computer vision. While non-differentiable models such as gradient boosting decision trees (GBDTs) are still the dominant methods for modeling discrete or tabular data, they are hard to incorporate with such representation learning ability. In this work, we propose the multi-layered GBDT forest (mGBDTs), with an explicit emphasis on exploring the ability to learn hierarchical distributed representations by stacking several layers of regression GBDTs as its building block. The model can be jointly trained by a variant of target propagation across layers, without the need to derive backpropagation nor differentiability. Experiments confirmed the effectiveness of the model in terms of performance and representation learning ability.

decision tree, name change, representation, (2 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Diagnosis (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Decision Tree Learning (0.68)

Add feedback

The Impact of Bootstrap Sampling Rate on Random Forest Performance in Regression Tasks

Iwaniuk, Michał, Jarosz, Mateusz, Borycki, Bartłomiej, Jezierski, Bartosz, Cwalina, Jan, Kaźmierczak, Stanisław, Mańdziuk, Jacek

arXiv.org Artificial IntelligenceNov-19-2025

Abstract--Random Forests (RFs) typically train each tree on a bootstrap sample of the same size as the training set, i.e., bootstrap rate (BR) equals 1.0. We systematically examine how varying BR from 0.2 to 5.0 affects RF performance across 39 heterogeneous regression datasets and 16 RF configurations, evaluating with repeated two-fold cross-validation and mean squared error . Our results demonstrate that tuning the BR can yield significant improvements over the default: the best setup relied on BR 1.0 for 24 datasets, BR > 1.0 for 15, and BR = 1.0 was optimal in 4 cases only. We establish a link between dataset characteristics and the preferred BR: datasets with strong global feature-target relationships favor higher BRs, while those with higher local target variance benefit from lower BRs. T o further investigate this relationship, we conducted experiments on synthetic datasets with controlled noise levels. These experiments reproduce the observed bias-variance trade-off: in low-noise scenarios, higher BRs effectively reduce model bias, whereas in high-noise settings, lower BRs help reduce model variance. Overall, BR is an influential hyperparameter that should be tuned to optimize RF regression models. ANDOM Forest (RF) is an ensemble machine learning (ML) algorithm involving a set of decision trees that collectively make a decision. In classification tasks, each tree votes for a particular class, and the predicted label is determined either by hard voting (majority vote) or soft voting (averaged class probabilities across the trees). In regression tasks, the final prediction is the mean of all individual tree outputs. RFs serve as a robust baseline across a wide range of ML problems, offering an effective balance of predictive accuracy, training speed, and moderate interpretability. While gradient-boosted trees or deep neural networks may outperform them in heavily tuned or domain-specific settings, RF models consistently deliver near-optimal results with minimal tuning, especially on structured, tabular datasets [1], [2].

artificial intelligence, decision tree learning, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2511.13952

Country: Europe (0.67)

Genre: Research Report > New Finding (1.00)

Industry:

Transportation (0.67)
Consumer Products & Services > Food, Beverage, Tobacco & Cannabis (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Decision Tree Learning (1.00)
(2 more...)

Add feedback

Empirical Likelihood for Random Forests and Ensembles

Chiang, Harold D., Matsushita, Yukitoshi, Otsu, Taisuke

arXiv.org Machine LearningNov-19-2025

We develop an empirical likelihood (EL) framework for random forests and related ensemble methods, providing a likelihood-based approach to quantify their statistical uncertainty. Exploiting the incomplete $U$-statistic structure inherent in ensemble predictions, we construct an EL statistic that is asymptotically chi-squared when subsampling induced by incompleteness is not overly sparse. Under sparser subsampling regimes, the EL statistic tends to over-cover due to loss of pivotality; we therefore propose a modified EL that restores pivotality through a simple adjustment. Our method retains key properties of EL while remaining computationally efficient. Theory for honest random forests and simulations demonstrate that modified EL achieves accurate coverage and practical reliability relative to existing inference methods.

artificial intelligence, machine learning, random forest, (14 more...)

arXiv.org Machine Learning

2511.13934

Country: North America > United States > Wisconsin (0.28)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Decision Tree Learning (0.95)
Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (0.85)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.66)

Add feedback

Asymptotic confidence bands for centered purely random forests

Neumeyer, Natalie, Rabe, Jan, Trabs, Mathias

arXiv.org Machine LearningNov-18-2025

In a multivariate nonparametric regression setting we construct explicit asymptotic uniform confidence bands for centered purely random forests. Since the most popular example in this class of random forests, namely the uniformly centered purely random forests, is well known to suffer from suboptimal rates, we propose a new type of purely random forests, called the Ehrenfest centered purely random forests, which achieve minimax optimal rates. Our main confidence band theorem applies to both random forests. The proof is based on an interpretation of random forests as generalized U-Statistics together with a Gaussian approximation of the supremum of empirical processes. Our theoretical findings are illustrated in simulation examples.

artificial intelligence, decision tree learning, machine learning, (18 more...)

arXiv.org Machine Learning

2511.13199

Country:

North America > United States > New York (0.04)
North America > United States > California (0.04)
Europe > Germany > Baden-Württemberg > Karlsruhe Region > Karlsruhe (0.04)

Genre: Research Report (0.63)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Decision Tree Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.46)

Add feedback

MorphBoost: Self-Organizing Universal Gradient Boosting with Adaptive Tree Morphing

Kriuk, Boris

arXiv.org Artificial IntelligenceNov-18-2025

Traditional gradient boosting algorithms employ static tree structures with fixed splitting criteria that remain unchanged throughout training, limiting their ability to adapt to evolving gradient distributions and problem-specific characteristics across different learning stages. This work introduces MorphBoost, a new gradient boosting framework featuring self-organizing tree structures that dynamically morph their splitting behavior during training. The algorithm implements adaptive split functions that evolve based on accumulated gradient statistics and iteration-dependent learning pressures, enabling automatic adjustment to problem complexity. Key innovations include: (1) morphing split criterion combining gradient-based scores with information-theoretic metrics weighted by training progress; (2) automatic problem fingerprinting for intelligent parameter configuration across binary/multiclass/regression tasks; (3) vectorized tree prediction achieving significant computational speedups; (4) interaction-aware feature importance detecting multiplicative relationships; and (5) fast-mode optimization balancing speed and accuracy. Comprehensive benchmarking across 10 diverse datasets against competitive models (XGBoost, LightGBM, GradientBoosting, HistGradientBoosting, ensemble methods) demonstrates that MorphBoost achieves state-of-the-art performance, outperforming XGBoost by 0.84% on average. MorphBoost secured the overall winner position with 4/10 dataset wins (40% win rate) and 6/30 top-3 finishes (20%), while maintaining the lowest variance (σ=0.0948) and highest minimum accuracy across all models, revealing superior consistency and robustness. Performance analysis across difficulty levels shows competitive results on easy datasets while achieving notable improvements on advanced problems due to higher adaptation levels.

artificial intelligence, gradient, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2511.13234

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (1.00)

Add feedback

LinkedIn Profile Characteristics and Professional Success Indicators

Eneye, Tania-Amanda Fredrick, Malla, Ashlesha, Paudel, Pawan

arXiv.org Artificial IntelligenceNov-18-2025

This study explores the relationship between LinkedIn profile characteristics and professional success, focusing on the indicators of promotions, follower count, and career progression rate. By leveraging a dataset of over 62,000 anonymized LinkedIn profiles, we developed predictive models using machine learning techniques to identify the most influential factors driving professional success. Results indicate that while promotions are highly predictable, follower growth exhibits greater complexity. This research provides actionable insights for professionals seeking to optimize their LinkedIn presence and career strategies.

artificial intelligence, machine learning, prediction, (16 more...)

arXiv.org Artificial Intelligence

2511.12905

Country: North America > United States > Texas (0.15)

Genre: Research Report > Experimental Study (0.66)

Industry: Information Technology > Services (1.00)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (0.34)

Add feedback

Probabilistic Wildfire Susceptibility from Remote Sensing Using Random Forests and SHAP

Cheerala, Udaya Bhasker, Chirukuri, Varun Teja, Gummadi, Venkata Akhil Kumar, Bhuyan, Jintu Moni, Damacharla, Praveen

arXiv.org Artificial IntelligenceNov-18-2025

Wildfires pose a significant global threat to ecosystems worldwide, with California experiencing recurring fires due to various factors, including climate, topographical features, vegetation patterns, and human activities. This study aims to develop a comprehensive wildfire risk map for California by applying the random forest (RF) algorithm, augmented with Explainable Artificial Intelligence (XAI) through Shapley Additive exPlanations (SHAP), to interpret model predictions. Model performance was assessed using both spatial and temporal validation strategies. The RF model demonstrated strong predictive performance, achieving near-perfect discrimination for grasslands (AUC = 0.996) and forests (AUC = 0.997). Spatial cross-validation revealed moderate transferability, yielding ROC-AUC values of 0.6155 for forests and 0.5416 for grasslands. In contrast, temporal split validation showed enhanced generalization, especially for forests (ROC-AUC = 0.6615, PR-AUC = 0.8423). SHAP-based XAI analysis identified key ecosystem-specific drivers: soil organic carbon, tree cover, and Normalized Difference Vegetation Index (NDVI) emerged as the most influential in forests, whereas Land Surface Temperature (LST), elevation, and vegetation health indices were dominant in grasslands. District-level classification revealed that Central Valley and Northern Buttes districts had the highest concentration of high-risk grasslands, while Northern Buttes and North Coast Redwoods dominated forested high-risk areas. This RF-SHAP framework offers a robust, comprehensible, and adaptable method for assessing wildfire risks, enabling informed decisions and creating targeted strategies to mitigate dangers.

artificial intelligence, grassland, machine learning, (14 more...)

arXiv.org Artificial Intelligence

2511.1168

Country: North America > United States > California (0.72)

Genre: Research Report (0.64)

Industry: Energy > Renewable > Geothermal > Geothermal Energy Exploration and Development > Geophysical Analysis & Survey (0.52)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (0.85)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.71)

Add feedback