AITopics | Ensemble Learning

Collaborating Authors

Ensemble Learning

Ensemble methods use multiple learning algorithms to obtain better predictive performance than could be obtained from any of the constituent learning algorithms alone. (Wikipedia)

News Overviews Instructional Materials AI-Alerts Classics

Practical considerations for variable screening in the Super Learner

Williamson, Brian D., King, Drew, Huang, Ying

arXiv.org Machine LearningNov-6-2023

Estimating a prediction function is a fundamental component of many data analyses. The Super Learner ensemble, a particular implementation of stacking, has desirable theoretical properties and has been used successfully in many applications. Dimension reduction can be accomplished by using variable screening algorithms, including the lasso, within the ensemble prior to fitting other prediction algorithms. However, the performance of a Super Learner using the lasso for dimension reduction has not been fully explored in cases where the lasso is known to perform poorly. We provide empirical results that suggest that a diverse set of candidate screening algorithms should be used to protect against poor performance of any one screen, similar to the guidance for choosing a library of prediction algorithms for the Super Learner.

artificial intelligence, machine learning, super learner, (16 more...)

arXiv.org Machine Learning

2311.03313

Country: North America > United States > Oklahoma > Payne County > Cushing (0.04)

Genre: Research Report (1.00)

Industry: Health & Medicine > Therapeutic Area > Immunology (0.69)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (0.50)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.47)

Add feedback

On Subagging Boosted Probit Model Trees

Qin, Tian, Huang, Wei-Min

arXiv.org Machine LearningNov-5-2023

With the insight of variance-bias decomposition, we design a new hybrid bagging-boosting algorithm named SBPMT for classification problems. For the boosting part of SBPMT, we propose a new tree model called Probit Model Tree (PMT) as base classifiers in AdaBoost procedure. For the bagging part, instead of subsampling from the dataset at each step of boosting, we perform boosted PMTs on each subagged dataset and combine them into a powerful "committee", which can be viewed an incomplete U-statistic. Our theoretical analysis shows that (1) SBPMT is consistent under certain assumptions, (2) Increase the subagging times can reduce the generalization error of SBPMT to some extent and (3) Large number of ProbitBoost iterations in PMT can benefit the performance of SBPMT with fewer steps in the AdaBoost part. Those three properties are verified by a famous simulation designed by Mease and Wyner (2008). The last two points also provide a useful guidance in model tuning. A comparison of performance with other state-of-the-art classification methods illustrates that the proposed SBPMT algorithm has competitive prediction power in general and performs significantly better in some cases.

classifier, probitboost, sbpmt, (15 more...)

arXiv.org Machine Learning

2311.02827

Country:

North America > United States > Pennsylvania > Northampton County > Bethlehem (0.04)
North America > United States > New York > New York County > New York City (0.04)
Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)

Genre: Research Report > New Finding (0.46)

Industry: Health & Medicine (0.68)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Decision Tree Learning (0.90)
Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (0.69)

Add feedback

Ransomware Detection and Classification using Machine Learning

Kunku, Kavitha, Zaman, ANK, Roy, Kaushik

arXiv.org Artificial IntelligenceNov-5-2023

Vicious assaults, malware, and various ransomware pose a cybersecurity threat, causing considerable damage to computer structures, servers, and mobile and web apps across various industries and businesses. These safety concerns are important and must be addressed immediately. Ransomware detection and classification are critical for guaranteeing rapid reaction and prevention. This study uses the XGBoost classifier and Random Forest (RF) algorithms to detect and classify ransomware attacks. This approach involves analyzing the behaviour of ransomware and extracting relevant features that can help distinguish between different ransomware families. The models are evaluated on a dataset of ransomware attacks and demonstrate their effectiveness in accurately detecting and classifying ransomware. The results show that the XGBoost classifier, Random Forest Classifiers, can effectively detect and classify different ransomware attacks with high accuracy, thereby providing a valuable tool for enhancing cybersecurity.

accuracy, algorithm, ransomware, (10 more...)

arXiv.org Artificial Intelligence

2311.16143

Country:

North America > Canada > Ontario > Waterloo Region > Waterloo (0.14)
North America > United States > North Carolina > Guilford County > Greensboro (0.04)
Asia > Middle East > UAE (0.04)

Genre: Research Report > New Finding (0.67)

Industry:

Information Technology > Security & Privacy (1.00)
Government > Military > Cyberwarfare (0.76)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.51)

Add feedback

Scalable Probabilistic Forecasting in Retail with Gradient Boosted Trees: A Practitioner's Approach

Long, Xueying, Bui, Quang, Oktavian, Grady, Schmidt, Daniel F., Bergmeir, Christoph, Godahewa, Rakshitha, Lee, Seong Per, Zhao, Kaifeng, Condylis, Paul

arXiv.org Artificial IntelligenceNov-2-2023

The recent M5 competition has advanced the state-of-the-art in retail forecasting. However, we notice important differences between the competition challenge and the challenges we face in a large e-commerce company. The datasets in our scenario are larger (hundreds of thousands of time series), and e-commerce can afford to have a larger assortment than brick-and-mortar retailers, leading to more intermittent data. To scale to larger dataset sizes with feasible computational effort, firstly, we investigate a two-layer hierarchy and propose a top-down approach to forecasting at an aggregated level with less amount of series and intermittency, and then disaggregating to obtain the decision-level forecasts. Probabilistic forecasts are generated under distributional assumptions. Secondly, direct training at the lower level with subsamples can also be an alternative way of scaling. Performance of modelling with subsets is evaluated with the main dataset. Apart from a proprietary dataset, the proposed scalable methods are evaluated using the Favorita dataset and the M5 dataset. We are able to show the differences in characteristics of the e-commerce and brick-and-mortar retail datasets. Notably, our top-down forecasting framework enters the top 50 of the original M5 competition, even with models trained at a higher level under a much simpler setting.

dataset, forecast, loss default 0, (14 more...)

arXiv.org Artificial Intelligence

2311.00993

Country:

North America > Trinidad and Tobago > Trinidad > Arima > Arima (0.04)
Asia > Singapore (0.04)
Asia > Indonesia (0.04)
(3 more...)

Genre: Research Report > New Finding (0.46)

Industry:

Retail (1.00)
Information Technology > Services (0.89)

Technology:

Information Technology > Modeling & Simulation (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
(3 more...)

Add feedback

Modified Genetic Algorithm for Feature Selection and Hyper Parameter Optimization: Case of XGBoost in Spam Prediction

Ghatasheh, Nazeeh, Altaharwa, Ismail, Aldebei, Khaled

arXiv.org Artificial IntelligenceOct-30-2023

Recently, spam on online social networks has attracted attention in the research and business world. Twitter has become the preferred medium to spread spam content. Many research efforts attempted to encounter social networks spam. Twitter brought extra challenges represented by the feature space size, and imbalanced data distributions. Usually, the related research works focus on part of these main challenges or produce black-box models. In this paper, we propose a modified genetic algorithm for simultaneous dimensionality reduction and hyper parameter optimization over imbalanced datasets. The algorithm initialized an eXtreme Gradient Boosting classifier and reduced the features space of tweets dataset; to generate a spam prediction model. The model is validated using a 50 times repeated 10-fold stratified cross-validation, and analyzed using nonparametric statistical tests. The resulted prediction model attains on average 82.32\% and 92.67\% in terms of geometric mean and accuracy respectively, utilizing less than 10\% of the total feature space. The empirical results show that the modified genetic algorithm outperforms $Chi^2$ and $PCA$ feature selection methods. In addition, eXtreme Gradient Boosting outperforms many machine learning algorithms, including BERT-based deep learning model, in spam prediction. Furthermore, the proposed approach is applied to SMS spam modeling and compared to related works.

algorithm, feature selection, selection and hyper parameter optimization, (11 more...)

arXiv.org Artificial Intelligence

doi: 10.1109/ACCESS.2022.3196905

2310.19845

Country:

Asia > Middle East > Republic of Türkiye > Bingoel Province > Bingol (0.04)
Asia > Middle East > Jordan > Aqaba Governorate > Aqaba (0.04)
North America > United States > New York > New York County > New York City (0.04)
(8 more...)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry:

Information Technology > Services (1.00)
Information Technology > Security & Privacy (1.00)
Health & Medicine (1.00)
Energy (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
(3 more...)

Add feedback

Machine Learning Algorithms to Predict Chess960 Result and Develop Opening Themes

Deo, Shreyan, Dwivedi, Nishchal

arXiv.org Artificial IntelligenceOct-29-2023

This work focuses on the analysis of Chess 960, also known as Fischer Random Chess, a variant of traditional chess where the starting positions of the pieces are randomized. The study aims to predict the game outcome using machine learning techniques and develop an opening theme for each starting position. The first part of the analysis utilizes machine learning models to predict the game result based on certain moves in each position. The methodology involves segregating raw data from .pgn files into usable formats and creating datasets comprising approximately 500 games for each starting position. Three machine learning algorithms -- KNN Clustering, Random Forest, and Gradient Boosted Trees -- have been used to predict the game outcome. To establish an opening theme, the board is divided into five regions: center, white kingside, white queenside, black kingside, and black queenside. The data from games played by top engines in all 960 positions is used to track the movement of pieces in the opening. By analysing the change in the number of pieces in each region at specific moves, the report predicts the region towards which the game is developing. These models provide valuable insights into predicting game outcomes and understanding the opening theme in Chess 960.

data set 1, data set 2, prediction, (10 more...)

arXiv.org Artificial Intelligence

2310.18938

Country:

Asia > India > NCT > Delhi (0.04)
Asia > India > Maharashtra > Mumbai (0.04)

Genre: Research Report > Experimental Study (0.34)

Industry: Leisure & Entertainment > Games > Chess (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (0.70)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.68)

Add feedback

Enhanced Gradient Boosting for Zero-Inflated Insurance Claims and Comparative Analysis of CatBoost, XGBoost, and LightGBM

So, Banghee

arXiv.org Artificial IntelligenceOct-28-2023

The property and casualty (P&C) insurance industry faces challenges in developing claim predictive models due to the highly right-skewed distribution of positive claims with excess zeros. To address this, actuarial science researchers have employed "zero-inflated" models that combine a traditional count model and a binary model. This paper investigates the use of boosting algorithms to process insurance claim data, including zero-inflated telematics data, to construct claim frequency models. Three popular gradient boosting libraries - XGBoost, LightGBM, and CatBoost - are evaluated and compared to determine the most suitable library for training insurance claim data and fitting actuarial frequency models. Through a comprehensive analysis of two distinct datasets, it is determined that CatBoost is the best for developing auto claim frequency models based on predictive performance. Furthermore, we propose a new zero-inflated Poisson boosted tree model, with variation in the assumption about the relationship between inflation probability $p$ and distribution mean $\mu$, and find that it outperforms others depending on data characteristics. This model enables us to take advantage of particular CatBoost tools, which makes it easier and more convenient to investigate the effects and interactions of various risk features on the frequency model when using telematics data.

dataset, gradient, loss function, (15 more...)

arXiv.org Artificial Intelligence

2307.07771

Country:

North America > United States > New York (0.04)
Europe > France (0.04)

Genre:

Research Report > Experimental Study (0.46)
Research Report > New Finding (0.46)

Industry: Banking & Finance > Insurance (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (1.00)

Add feedback

Stability of Random Forests and Coverage of Random-Forest Prediction Intervals

Wang, Yan, Wu, Huaiqing, Nettleton, Dan

arXiv.org Machine LearningOct-28-2023

We establish stability of random forests under the mild condition that the squared response ($Y^2$) does not have a heavy tail. In particular, our analysis holds for the practical version of random forests that is implemented in popular packages like \texttt{randomForest} in \texttt{R}. Empirical results show that stability may persist even beyond our assumption and hold for heavy-tailed $Y^2$. Using the stability property, we prove a non-asymptotic lower bound for the coverage probability of prediction intervals constructed from the out-of-bag error of random forests. With another mild condition that is typically satisfied when $Y$ is continuous, we also establish a complementary upper bound, which can be similarly established for the jackknife prediction interval constructed from an arbitrary stable algorithm. We also discuss the asymptotic coverage probability under assumptions weaker than those considered in previous literature. Our work implies that random forests, with its stability property, is an effective machine learning method that can provide not only satisfactory point prediction but also justified interval prediction at almost no extra computational cost.

artificial intelligence, machine learning, stability, (18 more...)

arXiv.org Machine Learning

2310.18814

Country:

South America > Paraguay > Asunción > Asunción (0.04)
North America > United States > Michigan > Wayne County > Detroit (0.04)
North America > United States > Iowa > Story County > Ames (0.04)
Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)

Genre: Research Report > New Finding (0.48)

Industry: Government (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Decision Tree Learning (1.00)

Add feedback

End-to-end Feature Selection Approach for Learning Skinny Trees

Ibrahim, Shibal, Behdin, Kayhan, Mazumder, Rahul

arXiv.org Artificial IntelligenceOct-27-2023

Joint feature selection and tree ensemble learning is a challenging task. Popular tree ensemble toolkits e.g., Gradient Boosted Trees and Random Forests support feature selection post-training based on feature importances, which are known to be misleading, and can significantly hurt performance. We propose Skinny Trees: a toolkit for feature selection in tree ensembles, such that feature selection and tree ensemble learning occurs simultaneously. It is based on an end-to-end optimization approach that considers feature selection in differentiable trees with Group $\ell_0 - \ell_2$ regularization. We optimize with a first-order proximal method and present convergence guarantees for a non-convex and non-smooth objective. Interestingly, dense-to-sparse regularization scheduling can lead to more expressive and sparser tree ensembles than vanilla proximal method. On 15 synthetic and real-world datasets, Skinny Trees can achieve $1.5\times$ - $620\times$ feature compression rates, leading up to $10\times$ faster inference over dense trees, without any loss in performance. Skinny Trees lead to superior feature selection than many existing toolkits e.g., in terms of AUC performance for $25\%$ feature budget, Skinny Trees outperforms LightGBM by $10.2\%$ (up to $37.7\%$), and Random Forests by $3\%$ (up to $12.5\%$).

feature selection, selection, skinny tree, (15 more...)

arXiv.org Artificial Intelligence

2310.18542

Country:

North America > United States > Massachusetts > Middlesex County > Cambridge (0.14)
North America > United States > New York > New York County > New York City (0.04)
Asia > Middle East > Jordan (0.04)
(2 more...)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Decision Tree Learning (1.00)

Add feedback

Improving the Timing Resolution of Positron Emission Tomography Detectors Using Boosted Learning -- A Residual Physics Approach

Naunheim, Stephan, Kuhl, Yannick, Schug, David, Schulz, Volkmar, Mueller, Florian

arXiv.org Artificial IntelligenceOct-26-2023

Artificial intelligence (AI) is entering medical imaging, mainly enhancing image reconstruction. Nevertheless, improvements throughout the entire processing, from signal detection to computation, potentially offer significant benefits. This work presents a novel and versatile approach to detector optimization using machine learning (ML) and residual physics. We apply the concept to positron emission tomography (PET), intending to improve the coincidence time resolution (CTR). PET visualizes metabolic processes in the body by detecting photons with scintillation detectors. Improved CTR performance offers the advantage of reducing radioactive dose exposure for patients. Modern PET detectors with sophisticated concepts and read-out topologies represent complex physical and electronic systems requiring dedicated calibration techniques. Traditional methods primarily depend on analytical formulations successfully describing the main detector characteristics. However, when accounting for higher-order effects, additional complexities arise matching theoretical models to experimental reality. Our work addresses this challenge by combining traditional calibration with AI and residual physics, presenting a highly promising approach. We present a residual physics-based strategy using gradient tree boosting and physics-guided data generation. The explainable AI framework SHapley Additive exPlanations (SHAP) was used to identify known physical effects with learned patterns. In addition, the models were tested against basic physical laws. We were able to improve the CTR significantly (more than 20%) for clinically relevant detectors of 19 mm height, reaching CTRs of 185 ps (450-550 keV).

detector, doi, tnnl, (16 more...)

arXiv.org Artificial Intelligence

doi: 10.1109/TNNLS.2023.3323131

2302.01681

Country:

North America > United States > New York > New York County > New York City (0.04)
Europe > Germany > North Rhine-Westphalia > Cologne Region > Aachen (0.04)

Genre: Research Report (0.70)

Industry: Health & Medicine > Diagnostic Medicine > Imaging (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (0.66)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.66)

Add feedback