Boosting is a technique in machine learning that has been shown to produce models with high predictive accuracy. One of the most common ways to implement boosting in practice is to use XGBoost, short for "extreme gradient boosting." This tutorial provides a step-by-step example of how to use XGBoost to fit a boosted model in R. For this example we'll fit a boosted regression model to the Boston dataset from the MASS package. This dataset contains 13 predictor variables that we'll use to predict one response variable called mdev, which represents the median value of homes in different census tracts around Boston. We can see that the dataset contains 506 observations and 14 total variables.
We extend the theory of boosting for regression problems to the online learning setting. Generalizing from the batch setting for boosting, the notion of a weak learning algorithm is modeled as an online learning algorithm with linear loss functions that competes with a base class of regression functions, while a strong learning algorithm is an online learning algorithm with smooth convex loss functions that competes with a larger class of regression functions. Our main result is an online gradient boosting algorithm which converts a weak online learning algorithm into a strong one where the larger class of functions is the linear span of the base class. We also give a simpler boosting algorithm that converts a weak online learning algorithm into a strong one where the larger class of functions is the convex hull of the base class, and prove its optimality.
The paper is heavily based on the previous work on lattice (and monotone lattice) regression, and is thus not self-contained. It is not even explained in the main part of the paper what is the actual function which a lattice represents (two examples are found in the supplementary materials, but I still found the explanation two brief). I think this should definitely be a part of the main paper, otherwise the reader is forced to check the previous work on this topic to understand what kind model is actually being considered. While (7) is convex given calibrators being fixed, it is not clear whether it is convex when jointly optimized over calibration and lattice parameters (I doubt it is). Moreover, (3) seems highly complex and combinatorial optimization criterion.
The idea of taking into account feature costs when pruning tree ensembles is original to the best of my knowledge. The main originality of the proposed approach is the fact that it adopts a bottom-up post-pruning strategy, while most existing approaches are top-down, acting during tree growing. While the authors present this feature as an advantage of their method, actually, I'm not convinced that adopting a bottom-up strategy is a good idea for addressing this problem. Since the algorithm indeed can not modify the existing tree structure (it can only prune it), it should be less efficient in terms of feature cost reduction than top-down methods that can have a direct impact on the features selected at tree nodes. For example, let us assume that two very important features in the dataset carry on the exact same information about the output (i.e, they are redundant).
We propose to prune a random forest (RF) for resource-constrained prediction. We first construct a RF and then prune it to optimize expected feature cost & accuracy. We pose pruning RFs as a novel 0-1 integer program with linear constraints that encourages feature re-use. We establish total unimodularity of the constraint set to prove that the corresponding LP relaxation solves the original integer program. We then exploit connections to combinatorial optimization and develop an efficient primal-dual algorithm, scalable to large datasets. In contrast to our bottom-up approach, which benefits from good RF initialization, conventional methods are top-down acquiring features based on their utility value and is generally intractable, requiring heuristics. Empirically, our pruning algorithm outperforms existing state-of-the-art resource-constrained algorithms.
The paper presents two nice ways for improving the usual gradient boosting algorithm where weak classifiers are decision trees. It is a paper oriented towards efficient (less costful) implementation of the usual algorithm in order to speed up the learning of decision trees by taking into account previous computations and sparse data. The approaches are interesting and smart. A risk bound is given for one of the improvements (GOSS), which seems sound but still quite loose: according to the experiments, a tighter bound could be obtained, getting rid of the "max" sizes of considered sets. No garantee is given for the second improvement (EFB) although is seems to be quite efficient in practice.
Gradient Boosting Decision Tree (GBDT) is a popular machine learning algorithm, and has quite a few effective implementations such as XGBoost and pGBRT. Although many engineering optimizations have been adopted in these implementations, the efficiency and scalability are still unsatisfactory when the feature dimension is high and data size is large. A major reason is that for each feature, they need to scan all the data instances to estimate the information gain of all possible split points, which is very time consuming. To tackle this problem, we propose two novel techniques: Gradient-based One-Side Sampling (GOSS) and Exclusive Feature Bundling (EFB). With GOSS, we exclude a significant proportion of data instances with small gradients, and only use the rest to estimate the information gain. We prove that, since the data instances with larger gradients play a more important role in the computation of information gain, GOSS can obtain quite accurate estimation of the information gain with a much smaller data size. With EFB, we bundle mutually exclusive features (i.e., they rarely take nonzero values simultaneously), to reduce the number of features. We prove that finding the optimal bundling of exclusive features is NP-hard, but a greedy algorithm can achieve quite good approximation ratio (and thus can effectively reduce the number of features without hurting the accuracy of split point determination by much).
Thus, the paper is similar to the work of Xu et al., 2012. The main differences are the fact that the feature and evaluation costs are input-specific, the evaluation cost depends on the number of tree splits, their optimization approach is different (based on the Taylor expansion around T_{k-1}, as described in the XGBoost paper), and they use best-first growth to grow the trees to a maximum number of splits (instead of a max depth). The authors point out that their setup works either in the case where feature cost dominates or evaluation cost dominates and they show experimental results for these settings.
Many applications require learning classifiers or regressors that are both accurate and cheap to evaluate. Prediction cost can be drastically reduced if the learned predictor is constructed such that on the majority of the inputs, it uses cheap features and fast evaluations. The main challenge is to do so with little loss in accuracy. In this work we propose a budget-aware strategy based on deep boosted regression trees. In contrast to previous approaches to learning with cost penalties, our method can grow very deep trees that on average are nonetheless cheap to compute. We evaluate our method on a number of datasets and find that it outperforms the current state of the art by a large margin. Our algorithm is easy to implement and its learning time is comparable to that of the original gradient boosting.
Model interpretability is an increasingly important component of practical machine learning. Some of the most common forms of interpretability systems are example-based, local, and global explanations. One of the main challenges in interpretability is designing explanation systems that can capture aspects of each of these explanation types, in order to develop a more thorough understanding of the model. We address this challenge in a novel model called MAPLE that uses local linear modeling techniques along with a dual interpretation of random forests (both as a supervised neighborhood approach and as a feature selection method). MAPLE has two fundamental advantages over existing interpretability systems. First, while it is effective as a black-box explanation system, MAPLE itself is a highly accurate predictive model that provides faithful self explanations, and thus sidesteps the typical accuracy-interpretability trade-off. Specifically, we demonstrate, on several UCI datasets, that MAPLE is at least as accurate as random forests and that it produces more faithful local explanations than LIME, a popular interpretability system. Second, MAPLE provides both example-based and local explanations and can detect global patterns, which allows it to diagnose limitations in its local explanations.