Ensemble Learning
Pair-Wise Hyperparameter Tuning with the Native XGBoost API
Our objective here is to perform hyperparameter tuning of the native XGBoost API in order to improve its regression performance while addressing bias-variance trade-off -- especially to alleviate Boosting Machine's tendency of overfitting. In order to conduct hyperparameter tuning, this analysis uses the grid search method. In other words, we select the search grid of hyperparameters and calculate the model performance over all the hyperparameter datapoints on the search-grid. Then, we identify the global local minimum of the performance -- or the hyperparameter datapoint which yields the best performance (the minimum value of the Objective Function) -- as the best hyperparameter values for the tuned model. Hyperparameter tuning can be computationally very expensive depending on how you set the search grid.
Prediction of drug effectiveness in rheumatoid arthritis patients based on machine learning algorithms
Chen, Shengjia, Gupta, Nikunj, Galbraith, Woodward B., Shah, Valay, Cirrone, Jacopo
Rheumatoid arthritis (RA) is an autoimmune condition caused when patients' immune system mistakenly targets their own tissue. Machine learning (ML) has the potential to identify patterns in patient electronic health records (EHR) to forecast the best clinical treatment to improve patient outcomes. This study introduced a Drug Response Prediction (DRP) framework with two main goals: 1) design a data processing pipeline to extract information from tabular clinical data, and then preprocess it for functional use, and 2) predict RA patient's responses to drugs and evaluate classification models' performance. We propose a novel two-stage ML framework based on European Alliance of Associations for Rheumatology (EULAR) criteria cutoffs to model drug effectiveness. Our model Stacked-Ensemble DRP was developed and cross-validated using data from 425 RA patients. The evaluation used a subset of 124 patients (30%) from the same data source. In the evaluation of the test set, two-stage DRP leads to improved classification accuracy over other end-to-end classification models for binary classification. Our proposed method provides a complete pipeline to predict disease activity scores and identify the group that does not respond well to anti-TNF treatments, thus showing promise in supporting clinical decisions based on EHR information.
Improving Data Quality with Training Dynamics of Gradient Boosting Decision Trees
Ponti, Moacir Antonelli, Oliveira, Lucas de Angelis, Romรกn, Juan Martรญn, Argerich, Luis
Real world datasets contain incorrectly labeled instances that hamper the performance of the model and, in particular, the ability to generalize out of distribution. Also, each example might have different contribution towards learning. This motivates studies to better understanding of the role of data instances with respect to their contribution in good metrics in models. In this paper we propose a method based on metrics computed from training dynamics of Gradient Boosting Decision Trees (GBDTs) to assess the behavior of each training example. We focus on datasets containing mostly tabular or structured data, for which the use of Decision Trees ensembles are still the state-of-the-art in terms of performance. We show results on detecting noisy labels in order to either remove them, improving models' metrics in synthetic and real datasets, as well as a productive dataset. Our methods achieved the best results overall when compared with confident learning and heuristics.
Uncertainty in Extreme Multi-label Classification
Jiang, Jyun-Yu, Chang, Wei-Cheng, Zhong, Jiong, Hsieh, Cho-Jui, Yu, Hsiang-Fu
Extreme multi-label classification (XMC), or extreme multi-label learning, aims to find the relevant labels for a data input from an enormous label space. With increasingly growing information in the era of big data, XMC has become more and more important, and has been widely applied to various real-world applications, such as advertising [37], product search [9], and document retrieval [6]. However, for domains with potential high risks from mistakes like public health and medicine, it is crucial to model the predictive uncertainty for their downstream XMC applications like food classification [54] and medical diagnosis [2]. In particular, an input sometimes could have only few or even no matches in the label space, so the outputs could be noisy without uncertainty quantification. It is also insufficient to only model uncertainty for the entire input since XMC models could have different confidence for each label among the whole enormous space. To estimate predictive uncertainty, Bayesian and probabilistic models [20] are inherently applicable because variance can intrinsically be viewed as an uncertainty measurement. However, although Bayesian approaches are mathematically grounded to model uncertainty, their computational costs are usually exorbitant for large-scale data. To address this issue, the most popular solution is to approximate Bayesian inference by sampling models as an ensemble [17].
Towards Domain-Independent Supervised Discourse Parsing Through Gradient Boosting
Huber, Patrick, Carenini, Giuseppe
Discourse analysis and discourse parsing have shown great impact on many important problems in the field of Natural Language Processing (NLP). Given the direct impact of discourse annotations on model performance and interpretability, robustly extracting discourse structures from arbitrary documents is a key task to further improve computational models in NLP. To this end, we present a new, supervised paradigm directly tackling the domain adaptation issue in discourse parsing. Specifically, we introduce the first fully supervised discourse parser designed to alleviate the domain dependency through a staged model of weak classifiers by introducing the gradient boosting framework.
Creating an Ensemble Voting Classifier with Scikit-Learn
Classification ensemble models are those composed by many models fitted to the same data, where the result for the classification can be the majority's vote, an average of the results, or the best performing model result. In Figure 1, there is an example of the voting classifier that we are going to build in this quick tutorial. Observe that there are three models fitted to the data. Two of them classified the data as 1, while one classified as 0. So, by the majority's vote, class 1 wins, and that is the result. In Scikit-Learn, a commonly used example of ensemble model is the Random Forest classifier.
Multi-Target XGBoostLSS Regression
Current implementations of Gradient Boosting Machines are mostly designed for single-target regression tasks and commonly assume independence between responses when used in multivariate settings. As such, these models are not well suited if non-negligible dependencies exist between targets. To overcome this limitation, we present an extension of XGBoostLSS that models multiple targets and their dependencies in a probabilistic regression setting. Empirical results show that our approach outperforms existing GBMs with respect to runtime and compares well in terms of accuracy.
Navigating Ensemble Configurations for Algorithmic Fairness
Feffer, Michael, Hirzel, Martin, Hoffman, Samuel C., Kate, Kiran, Ram, Parikshit, Shinnar, Avraham
Bias mitigators can improve algorithmic fairness in machine learning models, but their effect on fairness is often not stable across data splits. A popular approach to train more stable models is ensemble learning, but unfortunately, it is unclear how to combine ensembles with mitigators to best navigate trade-offs between fairness and predictive performance. To that end, we built an open-source library enabling the modular composition of 8 mitigators, 4 ensembles, and their corresponding hyperparameters, and we empirically explored the space of configurations on 13 datasets. We distilled our insights from this exploration in the form of a guidance diagram for practitioners that we demonstrate is robust and reproducible.
Instance-Based Uncertainty Estimation for Gradient-Boosted Regression Trees
Brophy, Jonathan, Lowd, Daniel
Gradient-boosted regression trees (GBRTs) are hugely popular for solving tabular regression problems, but provide no estimate of uncertainty. We propose Instance-Based Uncertainty estimation for Gradient-boosted regression trees (IBUG), a simple method for extending any GBRT point predictor to produce probabilistic predictions. IBUG computes a non-parametric distribution around a prediction using the $k$-nearest training instances, where distance is measured with a tree-ensemble kernel. The runtime of IBUG depends on the number of training examples at each leaf in the ensemble, and can be improved by sampling trees or training instances. Empirically, we find that IBUG achieves similar or better performance than the previous state-of-the-art across 22 benchmark regression datasets. We also find that IBUG can achieve improved probabilistic performance by using different base GBRT models, and can more flexibly model the posterior distribution of a prediction than competing methods. We also find that previous methods suffer from poor probabilistic calibration on some datasets, which can be mitigated using a scalar factor tuned on the validation data. Source code is available at https://www.github.com/jjbrophy47/ibug.
Out of Bag (OOB) Evaluation in Random Forests
Out of Bag (OOB) Evaluation is a very important yet underrated topic in ensemble learning. People tend to learn a lot about Random forests and other bagging algorithms, but often they tend to skip or overlook this concept. I myself missed it while learning about ensemble models and failed an interview where the last question asked was "How are the Out of Bag data utilized while training a random forest model?" (hence, decided to write this blog as a lesson) Oops! Cannot recall random forests? Basically, it is nothing but absolute supervised learning based on the concept of creating independent base learners (multiple decision trees containing bootstrapped samples from the original dataset) and training them. The bootstrapped samples are created by random sampling with replacement of dataset(d), with n features, where each sample d is less than d, and n n.