Ensemble Learning
Random Planted Forest: a directly interpretable tree ensemble
Hiabu, Munir, Mammen, Enno, Meyer, Joseph T.
We introduce a novel interpretable and tree-based algorithm for prediction in a regression setting in which each tree in a classical random forest is replaced by a family of planted trees that grow simultaneously. The motivation for our algorithm is to estimate the unknown regression function from a functional ANOVA decomposition perspective, where each tree corresponds to a function within that decomposition. Therefore, planted trees are limited in the number of interaction terms. The maximal order of approximation in the ANOVA decomposition can be specified or left unlimited. If a first order approximation is chosen, the result is an additive model. In the other extreme case, if the order of approximation is not limited, the resulting model puts no restrictions on the form of the regression function. In a simulation study we find encouraging prediction and visualisation properties of our random planted forest method. We also develop theory for an idealised version of random planted forests in the case of an underlying additive model. We show that in the additive case, the idealised version achieves up to a logarithmic factor asymptotically optimal one-dimensional convergence rates of order $n^{-2/5}$.
Effective Email Spam Detection System using Extreme Gradient Boosting
Mustapha, Ismail B., Hasan, Shafaatunnur, Olatunji, Sunday O., Shamsuddin, Siti Mariyam, Kazeem, Afolabi
The popularity, cost-effectiveness and ease of information exchange that electronic mails offer to electronic device users has been plagued with the rising number of unsolicited or spam emails. Driven by the need to protect email users from this growing menace, research in spam email filtering/detection systems has being increasingly active in the last decade. However, the adaptive nature of spam emails has often rendered most of these systems ineffective. While several spam detection models have been reported in literature, the reported performance on an out of sample test data shows the room for more improvement. Presented in this research is an improved spam detection model based on Extreme Gradient Boosting (XGBoost) which to the best of our knowledge has received little attention spam email detection problems. Experimental results show that the proposed model outperforms earlier approaches across a wide range of evaluation metrics. A thorough analysis of the model results in comparison to the results of earlier works is also presented.
The COVID-19 pandemic: socioeconomic and health disparities
Disadvantaged groups around the world have suffered and endured higher mortality during the current COVID-19 pandemic. This contrast disparity suggests that socioeconomic and health-related factors may drive inequality in disease outcome. To identify these factors correlated with COVID-19 outcome, country aggregate data provided by the Lancet COVID-19 Commission subjected to correlation analysis. Socioeconomic and health-related variables were used to predict mortality in the top 5 most affected countries using ridge regression and extreme gradient boosting (XGBoost) models. Our data reveal that predictors related to demographics and social disadvantage correlate with COVID-19 mortality per million and that XGBoost performed better than ridge regression. Taken together, our findings suggest that the health consequence of the current pandemic is not just confined to indiscriminate impact of a viral infection but that these preventable effects are amplified based on pre-existing health and socioeconomic inequalities.
(Decision and regression) tree ensemble based kernels for regression and classification
Feng, Dai, Baumgartner, Richard
Tree based ensembles such as Breiman's random forest (RF) and Gradient Boosted Trees (GBT) can be interpreted as implicit kernel generators, where the ensuing proximity matrix represents the data-driven tree ensemble kernel. Kernel perspective on the RF has been used to develop a principled framework for theoretical investigation of its statistical properties. Recently, it has been shown that the kernel interpretation is germane to other tree-based ensembles e.g. GBTs. However, practical utility of the links between kernels and the tree ensembles has not been widely explored and systematically evaluated. Focus of our work is investigation of the interplay between kernel methods and the tree based ensembles including the RF and GBT. We elucidate the performance and properties of the RF and GBT based kernels in a comprehensive simulation study comprising of continuous and binary targets. We show that for continuous targets, the RF/GBT kernels are competitive to their respective ensembles in higher dimensional scenarios, particularly in cases with larger number of noisy features. For the binary target, the RF/GBT kernels and their respective ensembles exhibit comparable performance. We provide the results from real life data sets for regression and classification to show how these insights may be leveraged in practice. Overall, our results support the tree ensemble based kernels as a valuable addition to the practitioner's toolbox. Finally, we discuss extensions of the tree ensemble based kernels for survival targets, interpretable prototype and landmarking classification and regression. We outline future line of research for kernels furnished by Bayesian counterparts of the frequentist tree ensembles.
Automatic detection of abnormal EEG signals using wavelet feature extraction and gradient boosting decision tree
Albaqami, Hezam, Hassan, Ghulam Mubashar, Subasi, Abdulhamit, Datta, Amitava
Electroencephalography is frequently used for diagnostic evaluation of various brain-related disorders due to its excellent resolution, non-invasive nature and low cost. However, manual analysis of EEG signals could be strenuous and a time-consuming process for experts. It requires long training time for physicians to develop expertise in it and additionally experts have low inter-rater agreement (IRA) among themselves. Therefore, many Computer Aided Diagnostic (CAD) based studies have considered the automation of interpreting EEG signals to alleviate the workload and support the final diagnosis. In this paper, we present an automatic binary classification framework for brain signals in multichannel EEG recordings. We propose to use Wavelet Packet Decomposition (WPD) techniques to decompose the EEG signals into frequency sub-bands and extract a set of statistical features from each of the selected coefficients. Moreover, we propose a novel method to reduce the dimension of the feature space without compromising the quality of the extracted features. The extracted features are classified using different Gradient Boosting Decision Tree (GBDT) based classification frameworks, which are CatBoost, XGBoost and LightGBM. We used Temple University Hospital EEG Abnormal Corpus V2.0.0 to test our proposed technique. We found that CatBoost classifier achieves the binary classification accuracy of 87.68%, and outperforms state-of-the-art techniques on the same dataset by more than 1% in accuracy and more than 3% in sensitivity. The obtained results in this research provide important insights into the usefulness of WPD feature extraction and GBDT classifiers for EEG classification.
Implementing the AdaBoost Algorithm From Scratch - KDnuggets
Boosting is an ensemble technique that attempts to create strong classifiers from a number of weak classifiers. Unlike many machine learning models which focus on high quality prediction done using single model, boosting algorithms seek to improve the prediction power by training a sequence of weak models, each compensating the weaknesses of its predecessors. Boosting grants power to machine learning models to improve their accuracy of prediction. AdaBoost, short for Adaptive Boosting, is a machine learning algorithm formulated by Yoav Freund and Robert Schapire. AdaBoost technique follows a decision tree model with a depth equal to one.
Understanding XGBoost Algorithm
XGBoost stands for "Extreme Gradient Boosting". XGBoost is an optimized distributed gradient boosting library designed to be highly efficient, flexible and portable. It implements Machine Learning algorithms under the Gradient Boosting framework. It provides a parallel tree boosting to solve many data science problems in a fast and accurate way. Boosting is an ensemble learning technique to build a strong classifier from several weak classifiers in series. Boosting algorithms play a crucial role in dealing with bias-variance trade-off.
Boost Up With XGboost
There are lots of articles out there talking about XGBoost and using it for models. And why shouldn't there be? It is a really powerful tool that has been proven to obtain great results in a wide variety of environments, favoring heterogeneous data. It has implementations in several languages, but in this article we are going to follow the trend of the previous ones and see the Python 3 implementation. The fancy name of the library comes from the algorithm used in it to train the model, but how does it work? Let's go backwards seeing what each word means.
Impact of weather factors on migration intention using machine learning algorithms
Aoga, John, Bae, Juhee, Veljanoska, Stefanija, Nijssen, Siegfried, Schaus, Pierre
A growing attention in the empirical literature has been paid to the incidence of climate shocks and change in migration decisions. Previous literature leads to different results and uses a multitude of traditional empirical approaches. This paper proposes a tree-based Machine Learning (ML) approach to analyze the role of the weather shocks towards an individual's intention to migrate in the six agriculture-dependent-economy countries such as Burkina Faso, Ivory Coast, Mali, Mauritania, Niger, and Senegal. We perform several tree-based algorithms (e.g., XGB, Random Forest) using the train-validation-test workflow to build robust and noise-resistant approaches. Then we determine the important features showing in which direction they are influencing the migration intention. This ML-based estimation accounts for features such as weather shocks captured by the Standardized Precipitation-Evapotranspiration Index (SPEI) for different timescales and various socioeconomic features/covariates. We find that (i) weather features improve the prediction performance although socioeconomic characteristics have more influence on migration intentions, (ii) country-specific model is necessary, and (iii) international move is influenced more by the longer timescales of SPEIs while general move (which includes internal move) by that of shorter timescales.
Guide To Ensemble Methods: Bagging vs Boosting
Building a highly accurate prediction model is certainly a difficult task. Noise – Irreducible error i.e. the part of target value which the model is not able to predict / explain. As you know it is impossible to reduce the noise, hence the term irreducible error, we shift our focus on reducing Bias and Variance. So, Ensemble learning methods bring up the technique to reduce the Bias and Variance of the model by using multiple models together (hence the term Ensemble), in order to achieve better predictive performance, instead of a single model for prediction. There are a number of Ensemble methods, in this article I will be discussing about the two widely used Ensemble methods that are Bagging and Boosting. When we use different / single learning algorithm, multiple times for prediction.