Goto

Collaborating Authors

 Ensemble Learning


A Journey through XGBoost: Milestone 1

#artificialintelligence

Welcome to another article series! This time, we are discussing XGBoost (Extreme Gradient Boosting) -- The leading and the most preferred machine learning algorithm among data scientists in the 21st century. Most people say XGBoost is a money-making algorithm because it easily outperforms any other algorithms, gives the best possible scores and helps its users to claim luxury cash prizes from data science competitions. The topic we are discussing is broad and important so that we discuss it through a series of articles. It is like a journey, maybe a long journey for newcomers.


Slow-Growing Trees

arXiv.org Machine Learning

Random Forest's performance can be matched by a single slow-growing tree (SGT), which uses a learning rate to tame CART's greedy algorithm. SGT exploits the view that CART is an extreme case of an iterative weighted least square procedure. Moreover, a unifying view of Boosted Trees (BT) and Random Forests (RF) is presented. Greedy ML algorithms' outcomes can be improved using either "slow learning" or diversification. SGT applies the former to estimate a single deep tree, and Booging (bagging stochastic BT with a high learning rate) uses the latter with additive shallow trees. The performance of this tree ensemble quaternity (Booging, BT, SGT, RF) is assessed on simulated and real regression tasks.


Generalised Boosted Forests

arXiv.org Machine Learning

This paper extends recent work on boosting random forests to model non-Gaussian responses. Given an exponential family $\mathbb{E}[Y|X] = g^{-1}(f(X))$ our goal is to obtain an estimate for $f$. We start with an MLE-type estimate in the link space and then define generalised residuals from it. We use these residuals and some corresponding weights to fit a base random forest and then repeat the same to obtain a boost random forest. We call the sum of these three estimators a \textit{generalised boosted forest}. We show with simulated and real data that both the random forest steps reduces test-set log-likelihood, which we treat as our primary metric. We also provide a variance estimator, which we can obtain with the same computational cost as the original estimate itself. Empirical experiments on real-world data and simulations demonstrate that the methods can effectively reduce bias, and that confidence interval coverage is conservative in the bulk of the covariate distribution.


The Glory of XGBoost

#artificialintelligence

There are so many machine learning algorithms out there, how do you choose the best one for your problem? This question is going to have a different response based on the application and the data. Is it classification, regression, supervised, unsupervised, natural language processing, time series? There are so many avenues to take but in this article I am going to focus on on algorithm that I particularly find very interesting, XGBoost. XGBoost stands for extreme gradient boosting and is an open source library that provides an efficient and effective implementation of gradient boosting.


Machine Learning May Reduce Mental Health Misdiagnosis

#artificialintelligence

Depressive episodes in bipolar disorder can be indistinguishable from those in major depressive disorder, leading to misdiagnosis and poor subsequent outcomes. Approximately 40% of patients with bipolar disorder are initially diagnosed with major depressive disorder; average delay in bipolar diagnosis ranges from 5.7 to 7.5 years. In conjunction with data from self-reports and blood biomarker data, a machine learning algorithm called Extreme Gradient Boosting (XGBoost) was able to distinguish between bipolar disorder and major depressive disorder. The predictive capabilities of artificial intelligence (AI) can assist researchers and clinicians in disciplines characterized by complexity and nuance. AI machine learning is increasingly being used in life sciences, biotechnology, and mental health.


Accurate classification of COVIDโ€19 patients with different severity via machine learning

#artificialintelligence

Infection of severe acute respiratory syndrome coronavirus 2 (SARSโ€CoVโ€2) could cause dramatic response in coronavirus disease 2019 (COVIDโ€19) patients at multiโ€omics level,1-3 thus it is essential to systematically assess the pathogenesis of COVIDโ€19. In our previous study, we presented the first transโ€omics landscape of 236 COVIDโ€19 patients with 4 clinical severity groups (including asymptomatic, mild, severe and critically ill cases) and found that the mild and severe COVIDโ€19 patients shared several similar characteristics.4 However, it is crucial to discriminate mild from severe COVIDโ€19 patients to prevent the latter from the progression of disease by facilitating early intervention. Herein, we developed an extreme gradient boosting (XGBoost) machineโ€learning model to predict the COVIDโ€19 severities by leveraging multiโ€omics data. Briefly, we randomly stratified samples for the training set (80%) and the independent testing set (20%) (Figure 1A, see Methods in the Supporting Information).


MDA for random forests: inconsistency, and a practical solution via the Sobol-MDA

arXiv.org Machine Learning

Variable importance measures are the main tools to analyze the black-box mechanism of random forests. Although the Mean Decrease Accuracy (MDA) is widely accepted as the most efficient variable importance measure for random forests, little is known about its theoretical properties. In fact, the exact MDA definition varies across the main random forest software. In this article, our objective is to rigorously analyze the behavior of the main MDA implementations. Consequently, we mathematically formalize the various implemented MDA algorithms, and then establish their limits when the sample size increases. In particular, we break down these limits in three components: the first two are related to Sobol indices, which are well-defined measures of a variable contribution to the output variance, widely used in the sensitivity analysis field, as opposed to the third term, whose value increases with dependence within input variables. Thus, we theoretically demonstrate that the MDA does not target the right quantity when inputs are dependent, a fact that has already been noticed experimentally. To address this issue, we define a new importance measure for random forests, the Sobol-MDA, which fixes the flaws of the original MDA. We prove the consistency of the Sobol-MDA and show its good empirical performance through experiments on both simulated and real data. An open source implementation in R and C++ is available online.


How to use PyCaret -- the library for lazy data scientists

#artificialintelligence

When we approach supervised machine learning problems, it can be tempting to just see how a random forest or gradient boosting model performs and stop experimenting if we are satisfied with the results. What if you could compare many different models with just one line of code? What if you could reduce each step of the data science process from feature engineering to model deployment to just a few lines of code? This is exactly where PyCaret comes into play. PyCaret is a high-level, low-code Python library that makes it easy to compare, train, evaluate, tune, and deploy machine learning models with only a few lines of code.


Tree boosting for learning probability measures

arXiv.org Machine Learning

Learning probability measures based on an i.i.d. sample is a fundamental inference task, but is challenging when the sample space is high-dimensional. Inspired by the success of tree boosting in high-dimensional classification and regression, we propose a tree boosting method for learning high-dimensional probability distributions. We formulate concepts of "addition'' and "residuals'' on probability distributions in terms of compositions of a new, more general notion of multivariate cumulative distribution functions (CDFs) than classical CDFs. This then gives rise to a simple boosting algorithm based on forward-stagewise (FS) fitting of an additive ensemble of measures. The output of the FS algorithm allows analytic computation of the probability density function for the fitted distribution. It also provides an exact simulator for drawing independent Monte Carlo samples from the fitted measure. Typical considerations in applying boosting -- namely choosing the number of trees, setting the appropriate level of shrinkage/regularization in the weak learner, and the evaluation of variable importance -- can be accomplished in an analogous fashion to traditional boosting in supervised learning. Numerical experiments confirm that boosting can substantially improve the fit to multivariate distributions compared to the state-of-the-art single-tree learner and is computationally efficient. We illustrate through an application to a data set from mass cytometry how the simulator can be used to investigate various aspects of the underlying distribution.


BEDS: Bagging ensemble deep segmentation for nucleus segmentation with testing stage stain augmentation

arXiv.org Artificial Intelligence

Reducing outcome variance is an essential task in deep learning based medical image analysis. Bootstrap aggregating, also known as bagging, is a canonical ensemble algorithm for aggregating weak learners to become a strong learner. Random forest is one of the most powerful machine learning algorithms before deep learning era, whose superior performance is driven by fitting bagged decision trees (weak learners). Inspired by the random forest technique, we propose a simple bagging ensemble deep segmentation (BEDs) method to train multiple U-Nets with partial training data to segment dense nuclei on pathological images. The contributions of this study are three-fold: (1) developing a self-ensemble learning framework for nucleus segmentation; (2) aggregating testing stage augmentation with self-ensemble learning; and (3) elucidating the idea that self-ensemble and testing stage stain augmentation are complementary strategies for a superior segmentation performance. Implementation Detail: https://github.com/xingli1102/BEDs.