Ensemble Learning
FairXGBoost: Fairness-aware Classification in XGBoost
Ravichandran, Srinivasan, Khurana, Drona, Venkatesh, Bharath, Edakunni, Narayanan Unny
Highly regulated domains such as finance have long favoured the use of machine learning algorithms that are scalable, transparent, robust and yield better performance. One of the most prominent examples of such an algorithm is XGBoost. Meanwhile, there is also a growing interest in building fair and unbiased models in these regulated domains and numerous bias-mitigation algorithms have been proposed to this end. However, most of these bias-mitigation methods are restricted to specific model families such as logistic regression or support vector machine models, thus leaving modelers with a difficult decision of choosing between fairness from the bias-mitigation algorithms and scalability, transparency, performance from algorithms such as XGBoost. We aim to leverage the best of both worlds by proposing a fair variant of XGBoost that enjoys all the advantages of XGBoost, while also matching the levels of fairness from the state-of-the-art bias-mitigation algorithms. Furthermore, the proposed solution requires very little in terms of changes to the original XGBoost library, thus making it easy for adoption. We provide an empirical analysis of our proposed method on standard benchmark datasets used in the fairness community.
Detecting Parkinson's Disease from Speech-task in an accessible and interpretable manner
Rahman, Wasifur, Lee, Sangwu, Islam, Md. Saiful, Mamun, Abdullah Al, Antony, Victor, Ratnu, Harshil, Ali, Mohammad Rafayet, Hoque, Ehsan
Every nine minutes a person is diagnosed with Parkinson's Disease (PD) in the United States. However, studies have shown that between 25 and 80\% of individuals with Parkinson's Disease (PD) remain undiagnosed. An online, in the wild audio recording application has the potential to help screen for the disease if risk can be accurately assessed. In this paper, we collect data from 726 unique subjects (262 PD and 464 Non-PD) uttering the "quick brown fox jumps over the lazy dog ...." to conduct automated PD assessment. We extracted both standard acoustic features and deep learning based embedding features from the speech data and trained several machine learning algorithms on them. Our models achieved 0.75 AUC by modeling the standard acoustic features through the XGBoost model. We also provide explanation behind our model's decision and show that it is focusing mostly on the widely used MFCC features and a subset of dysphonia features previously used for detecting PD from verbal phonation task.
Improved Weighted Random Forest for Classification Problems
Shahhosseini, Mohsen, Hu, Guiping
Several studies have shown that combining machine learning models in an appropriate way will introduce improvements in the individual predictions made by the base models. The key to make well-performing ensemble model is in the diversity of the base models. Of the most common solutions for introducing diversity into the decision trees are bagging and random forest. Bagging enhances the diversity by sampling with replacement and generating many training data sets, while random forest adds selecting a random number of features as well. This has made the random forest a winning candidate for many machine learning applications. However, assuming equal weights for all base decision trees does not seem reasonable as the randomization of sampling and input feature selection may lead to different levels of decision-making abilities across base decision trees. Therefore, we propose several algorithms that intend to modify the weighting strategy of regular random forest and consequently make better predictions. The designed weighting frameworks include optimal weighted random forest based on ac-curacy, optimal weighted random forest based on the area under the curve (AUC), performance-based weighted random forest, and several stacking-based weighted random forest models. The numerical results show that the proposed models are able to introduce significant improvements compared to regular random forest.
Random Forest (RF) Kernel for Regression, Classification and Survival
Feng, Dai, Baumgartner, Richard
Breiman's random forest (RF) can be interpreted as an implicit kernel generator,where the ensuing proximity matrix represents the data-driven RF kernel. Kernel perspective on the RF has been used to develop a principled framework for theoretical investigation of its statistical properties. However, practical utility of the links between kernels and the RF has not been widely explored and systematically evaluated.Focus of our work is investigation of the interplay between kernel methods and the RF. We elucidate the performance and properties of the data driven RF kernels used by regularized linear models in a comprehensive simulation study comprising of continuous, binary and survival targets. We show that for continuous and survival targets, the RF kernels are competitive to RF in higher dimensional scenarios with larger number of noisy features. For the binary target, the RF kernel and RF exhibit comparable performance. As the RF kernel asymptotically converges to the Laplace kernel, we included it in our evaluation. For most simulation setups, the RF and RFkernel outperformed the Laplace kernel. Nevertheless, in some cases the Laplace kernel was competitive, showing its potential value for applications. We also provide the results from real life data sets for the regression, classification and survival to illustrate how these insights may be leveraged in practice.Finally, we discuss further extensions of the RF kernels in the context of interpretable prototype and landmarking classification, regression and survival. We outline future line of research for kernels furnished by Bayesian counterparts of the RF.
MementoML: Performance of selected machine learning algorithm configurations on OpenML100 datasets
Kretowicz, Wojciech, Biecek, Przemysław
Finding optimal hyperparameters for the machine learning algorithm can often significantly improve its performance. But how to choose them in a time-efficient way? In this paper we present the protocol of generating benchmark data describing the performance of different ML algorithms with different hyperparameter configurations. Data collected in this way is used to study the factors influencing the algorithm's performance. This collection was prepared for the purposes of the study presented in the EPP study. We tested algorithms performance on dense grid of hyperparameters. Tested datasets and hyperparameters were chosen before any algorithm has run and were not changed. This is a different approach than the one usually used in hyperparameter tuning, where the selection of candidate hyperparameters depends on the results obtained previously. However, such selection allows for systematic analysis of performance sensitivity from individual hyperparameters. This resulted in a comprehensive dataset of such benchmarks that we would like to share. We hope, that computed and collected result may be helpful for other researchers. This paper describes the way data was collected. Here you can find benchmarks of 7 popular machine learning algorithms on 39 OpenML datasets. The detailed data forming this benchmark are available at: https://www.kaggle.com/mi2datalab/mementoml.
Random Forest Vs XGBoost Tree Based Algorithms
In machine learning, we mainly deal with two kinds of problems that are classification and regression. There are several different types of algorithms for both tasks. But we need to pick that algorithm whose performance is good on the respective data. Ensemble methods like Random Forest, Decision Tree, XGboost algorithms have shown very good results when we talk about classification. These algorithms give high accuracy at fast speed.
agtboost: Adaptive and Automatic Gradient Tree Boosting Computations
Lunde, Berent Ånund Strømnes, Kleppe, Tore Selland
Gradient tree boosting (GTB) (Friedman 2001; Mason, Baxter, Bartlett, and Frean 1999) has risen to prominence for regression problems after the introduction of xgboost (Chen and Guestrin 2016). The GTB model is an ensemble-type model, that consist of classification and regression trees (CART) (Breiman, Friedman, Stone, and Olshen 1984) that are learned in an iterative manner. GTB models are very flexible in that they automatically learn nonlinear relationships and interaction effects. However, with the increased flexibility of GTB models comes substantial worries of overfitting. The top performing gradient tree boosting libraries, such as xgboost, LightGBM (Ke, Meng, Finley, Wang, Chen, Ma, Ye, and Liu 2017) and catboost (Dorogush, Ershov, and Gulin 2018), all come with a large number of hyperparameters available for manual tuning to constrain the complexity of the GTB models. Training of gradient tree boosting models, in general, thus require some familiarity with both the chosen package, and the data for efficient tuning and application to the problem at hand. The main focus of the hyperparameters and tuning are to solve the following problems: - The complexity of trees: What are the topology of all the different trees?
How The New AI Model For Rapid COVID-19 Screening Works?
With the current pandemic spreading like wildfire, the requirement for a faster diagnosis can not be more critical than now. As a matter of fact, the traditional real-time polymerase chain reaction testing (RT-PCR) using the nose and throat swab has not only been termed to have limited sensitivity but also time-consuming for operational reasons. Thus, to expedite the process of COVID-19 diagnosis, researchers from the University of Oxford developed two early-detection AI models leveraging the routine data collected from clinical reports. In a recent paper, the Oxford researchers revealed the two AI models and highlighted its effectiveness in screening the virus in patients coming for checkups to the hospital -- for an emergency checkup or for admitting in the hospital. To validate these real-time prediction models, researchers used primary clinical data, including lab tests of the patients, their vital signs and their blood reports.
Machine learning for the selection of carbon-based materials for tetracycline and sulfamethoxazole adsorption
Antiobiotics adsorption on carbon-based materials was modeled by machine learning. Random forest showed best prediction accuracy than GBT and ANN. Impact tendencies of SBET, pHsol, C0 on adsorption were similar for TC and SMX. Chemical compositions and pHpzc of CBMs showed different influences on TC and SMX. Antibiotics as emerging pollutants have attracted extensive attention due to their ecotoxicity and persistence in the environment.
Feature Selection Methods for Cost-Constrained Classification in Random Forests
Jagdhuber, Rudolf, Lang, Michel, Rahnenführer, Jörg
Cost-sensitive feature selection describes a feature selection problem, where features raise individual costs for inclusion in a model. These costs allow to incorporate disfavored aspects of features, e.g. failure rates of as measuring device, or patient harm, in the model selection process. Random Forests define a particularly challenging problem for feature selection, as features are generally entangled in an ensemble of multiple trees, which makes a post hoc removal of features infeasible. Feature selection methods therefore often either focus on simple pre-filtering methods, or require many Random Forest evaluations along their optimization path, which drastically increases the computational complexity. To solve both issues, we propose Shallow Tree Selection, a novel fast and multivariate feature selection method that selects features from small tree structures. Additionally, we also adapt three standard feature selection algorithms for cost-sensitive learning by introducing a hyperparameter-controlled benefit-cost ratio criterion (BCR) for each method. In an extensive simulation study, we assess this criterion, and compare the proposed methods to multiple performance-based baseline alternatives on four artificial data settings and seven real-world data settings. We show that all methods using a hyperparameterized BCR criterion outperform the baseline alternatives. In a direct comparison between the proposed methods, each method indicates strengths in certain settings, but no one-fits-all solution exists. On a global average, we could identify preferable choices among our BCR based methods. Nevertheless, we conclude that a practical analysis should never rely on a single method only, but always compare different approaches to obtain the best results.