Ensemble methods use multiple learning algorithms to obtain better predictive performance than could be obtained from any of the constituent learning algorithms alone. (Wikipedia)
This toolkit serves to execute RFEX 2.0 "pipeline" e.g. a set of steps to produce information which comprises RFEX 2.0 summary namely information to enhance explainability of Random Forest classifier. It comes with the synthetically generated test database which helps to demonstrate how RFEX 2.0 works. Wth this toolkit users can also use their own data to generate RFEX 2.0 summary. Background of the RFEX 2.0 method, as well as the description and access to the synthetic test database convenient to test and demonstrate can be found in TR 18.01 at cs.sfsu.edu Users are strongly advised to read the above report before using this toolkit.
In the last few years, there have been many methodological and theoretical advances in the random forests approach. Some methodological developments and extensions include case-specific random forests , multivariate random forests , quantile regression forests , random survival forests , enriched random forests for microarry data  and predictor augmentation in random forests  among others. For theoretical developments, the statistical and asymptotic properties of random forests have been intensively investigated. Advances have been made in the areas such as consistency  , variable selection  and the construction of confidence intervals . Although RF methodology has proven itself to be a reliable predictive approach in many application areas , there are some cases where random forests may suffer. First, as a fully nonparametric predictive algorithm, random forests may not efficiently incorporate known relationships between the response and the predictors. Second, random forests may fail in extrapolation problems where predictions are required at points out of the domain of the training dataset. For regression problems, a random forest prediction is an average of the predictions produced by the trees in the forest.
The Gradient Boosted Tree (GBT) algorithm is one of the most popular machine learning algorithms used in production, for tasks that include Click-Through Rate (CTR) prediction and learning-to-rank. To deal with the massive datasets available today, many distributed GBT methods have been proposed. However, they all assume a row-distributed dataset, addressing scalability only with respect to the number of data points and not the number of features, and increasing communication cost for high-dimensional data. In order to allow for scalability across both the data point and feature dimensions, and reduce communication cost, we propose block-distributed GBTs. We achieve communication efficiency by making full use of the data sparsity and adapting the Quickscorer algorithm to the block-distributed setting. We evaluate our approach using datasets with millions of features, and demonstrate that we are able to achieve multiple orders of magnitude reduction in communication cost for sparse data, with no loss in accuracy, while providing a more scalable design. As a result, we are able to reduce the training time for high-dimensional data, and allow more cost-effective scale-out without the need for expensive network communication.
Gradient Boosting Machine (GBM) is an extremely powerful supervised learning algorithm that is widely used in practice. GBM routinely features as a leading algorithm in machine learning competitions such as Kaggle and the KDDCup. In this work, we propose Accelerated Gradient Boosting Machine (AGBM) by incorporating Nesterov's acceleration techniques into the design of GBM. The difficulty in accelerating GBM lies in the fact that weak (inexact) learners are commonly used, and therefore the errors can accumulate in the momentum term. To overcome it, we design a "corrected pseudo residual" and fit best weak learner to this corrected pseudo residual, in order to perform the z-update. Thus, we are able to derive novel computational guarantees for AGBM. This is the first GBM type of algorithm with theoretically-justified accelerated convergence rate. Finally we demonstrate with a number of numerical experiments the effectiveness of AGBM over conventional GBM in obtaining a model with good training and/or testing data fidelity.
A data set is called imbalanced if it contains many more samples from one class than from the rest of the classes. Data sets are unbalanced when at least one class is represented by only a small number of training examples (called the minority class) while other classes make up the majority. In this scenario, classifiers can have good accuracy on the majority class but very poor accuracy on the minority class(es) due to the influence that the larger majority class. The common example of such dataset is credit card fraud detection, where data points for fraud 1, are usually very less in comparison to fraud 0. There are many reasons why a dataset might be imbalanced: the category one is targeting might be very rare in the population, or the data might simply be difficult to collect. Let's solve the problem of an imbalanced dataset by working on one such dataset.
This paper examines split-improvement feature importance scores for tree-based methods. Starting with Classification and Regression Trees (CART; Breiman, 2017) and C4.5 (Quinlan, 2014), decision trees have been a workhorse of general machine learning, particularly within ensemble methods such as Random Forests (RF; Breiman, 2001) and Gradient Boosting Trees (Friedman, 2001). They enjoy the benefits of computational speed, few tuning parameters and natural ways of handling missing values.
Random forests (RF) are one of the most widely used ensemble learning methods in classification and regression tasks. Despite its impressive performance, its theoretical consistency, which would ensure that its result converges to the optimum as the sample size increases, has been left far behind. Several consistent random forest variants have been proposed, yet all with relatively poor performance compared to the original random forests. In this paper, a novel RF framework named multinomial random forests (MRF) is proposed. In the MRF, an impurity-based multinomial distribution is constructed as the basis for the selection of a splitting point. This ensures that a certain degree of randomness is achieved while the overall quality of the trees is not much different from the original random forests. We prove the consistency of the MRF and demonstrate with multiple datasets that it performs similarly as the original random forests and better than existent consistent random forest variants for both classification and regression tasks.
Let's now look at a concrete example. The problem is to predict math, reading and writing grades for high-school students in the U.S. We are given historical data that include features like -- gender, race/ethnicity (which is anonymized), parent level of education, whether the student ate a standard/free/subsidized lunch and the level of preparation for tests. Given this data, I trained a multi-class random forest model [source code]. In order to explain what the model has learned, one of the simplest techniques is to look at the relative feature importance. Feature importance measures how big an impact a given feature has on predicting the outcome.