Ensemble Learning
A high-bias, low-variance introduction to Machine Learning for physicists
Mehta, Pankaj, Bukov, Marin, Wang, Ching-Hao, Day, Alexandre G. R., Richardson, Clint, Fisher, Charles K., Schwab, David J.
Machine Learning (ML) is one of the most exciting and dynamic areas of modern research and application. The purpose of this review is to provide an introduction to the core concepts and tools of machine learning in a manner easily understood and intuitive to physicists. The review begins by covering fundamental concepts in ML and modern statistics such as the bias-variance tradeoff, overfitting, regularization, and generalization before moving on to more advanced topics in both supervised and unsupervised learning. Topics covered in the review include ensemble models, deep learning and neural networks, clustering and data visualization, energy-based models (including MaxEnt models and Restricted Boltzmann Machines), and variational methods. Throughout, we emphasize the many natural connections between ML and statistical physics. A notable aspect of the review is the use of Python notebooks to introduce modern ML/statistical packages to readers using physics-inspired datasets (the Ising Model and Monte-Carlo simulations of supersymmetric decays of proton-proton collisions). We conclude with an extended outlook discussing possible uses of machine learning for furthering our understanding of the physical world as well as open problems in ML where physicists maybe able to contribute. (Notebooks are available at https://physics.bu.edu/~pankajm/MLnotebooks.html )
Boosting Random Forests to Reduce Bias; One-Step Boosted Forest and its Variance Estimate
Ghosal, Indrayudh, Hooker, Giles
In this paper we propose using the principle of boosting to reduce the bias of a random forest prediction in the regression setting. From the original random forest fit we extract the residuals and then fit another random forest to these residuals. We call the sum of these two random forests a \textit{one-step boosted forest}. We have shown with simulated and real data that the one-step boosted forest has a reduced bias compared to the original random forest. The paper also provides a variance estimate of the one-step boosted forest by an extension of the infinitesimal Jackknife estimator. Using this variance estimate we can construct prediction intervals for the boosted forest and we show that they have good coverage probabilities. Combining the bias reduction and the variance estimate we have shown that the one-step boosted forest has a significant reduction in predictive mean squared error and thus an improvement in predictive performance. When applied on datasets from the UCI database we have empirically proven that the one-step boosted forest performs better than the random forest and gradient boosting machine algorithms. Theoretically we can also extend such a boosting process to more than one step and the same principles outlined in this paper can be used to find variance estimates for such predictors. Such boosting will reduce bias even further but it risks over-fitting and also increases the computational burden.
Randomer Forests
Tomita, Tyler M., Browne, James, Shen, Cencheng, Priebe, Carey E., Burns, Randal, Maggioni, Mauro, Vogelstein, Joshua T.
Ensemble methods -- particularly those based on decision trees -- have recently demonstrated superior performance in a variety of machine learning settings. Specifically, Random Forest (RF) was found to outperform >100 other methods in several manuscripts, and gradient boosting trees have been a crucial component of several recent Kaggle competition victories. Building off these successes and recent advances in sparse learning and random matrix theory, we propose a novel ensemble tree method called "Randomer Forest" (RerF). The key intuition behind RerF is that we can use sparse linear combinations at each decision node rather than just one feature (as in RF) or all of them (as in Rotation Forests). RerF significantly outperforms other methods on a standard benchmark suite containing 105 problems with varying dimension, sample size, and number of classes. Moreover, we provide an implementation that scales as or more efficiently than other available packages. Via a combination of basic principles, theory, and extensive numerical experiments, we demonstrate why, when, and how RerF achieves its performance properties.
Complete Guide to Parameter Tuning in Gradient Boosting (GBM) in Python
If you have been using GBM as a'black box' till now, may be it's time for you to open it and see, how it actually works! This article is inspired by Owen Zhang's (Chief Product Officer at DataRobot and Kaggle Rank 3) approach shared at NYC Data Science Academy. He delivered a 2 hours talk and I intend to condense it and present the most precious nuggets here. Boosting algorithms play a crucial role in dealing with bias variance trade-off. Unlike bagging algorithms, which only controls for high variance in a model, boosting controls both the aspects (bias & variance), and is considered to be more effective.
Finding Influential Training Samples for Gradient Boosted Decision Trees
Sharchilev, Boris, Ustinovsky, Yury, Serdyukov, Pavel, de Rijke, Maarten
We address the problem of finding influential training samples for a particular case of tree ensemble-based models, e.g., Random Forest (RF) or Gradient Boosted Decision Trees (GBDT). A natural way of formalizing this problem is studying how the model's predictions change upon leave-one-out retraining, leaving out each individual training sample. Recent work has shown that, for parametric models, this analysis can be conducted in a computationally efficient way. We propose several ways of extending this framework to non-parametric GBDT ensembles under the assumption that tree structures remain fixed. Furthermore, we introduce a general scheme of obtaining further approximations to our method that balance the trade-off between performance and computational complexity. We evaluate our approaches on various experimental setups and use-case scenarios and demonstrate both the quality of our approach to finding influential training samples in comparison to the baselines and its computational efficiency.
Teaching computers to guide science: Machine learning method sees forests and trees: 'Iterative Random Forests' will deliver powerful scientific insights, researchers say
In a paper published recently in the Proceedings of the National Academy of Sciences (PNAS), the researchers describe a technique called "iterative Random Forests," which they say could have a transformative effect on any area of science or engineering with complex systems, including biology, precision medicine, materials science, environmental science, and manufacturing, to name a few. "Take a human cell, for example. There are 10170 possible molecular interactions in a single cell. That creates considerable computing challenges in searching for relationships," said Ben Brown, head of Berkeley Lab's Molecular Ecosystems Biology Department. "Our method enables the identification of interactions of high order at the same computational cost as main effects -- even when those interactions are local with weak marginal effects."
Gradient Boosting in TensorFlow vs XGBoost
Tensorflow 1.4 was released a few weeks ago with an implementation of Gradient Boosting, called TensorFlow Boosted Trees (TFBT). Unfortunately, the paper does not have any benchmarks, so I ran some against XGBoost. For many Kaggle-style data mining problems, XGBoost has been the go-to solution since its release in 2016. It's probably as close to an out-of-the-box machine learning algorithm as you can get today, as it gracefully handles un-normalized or missing data, while being accurate and fast to train. The code to reproduce the results in this article is on GitHub.
Tuning Random Forest model Machine Learning Predictive modeling
A month back, I participated in a Kaggle competition called TFI. I started with my first submission at 50th percentile. Having worked relentlessly on feature engineering for more than 2 weeks, I managed to reach 20th percentile. To my surprise, right after tuning the parameters of the machine learning algorithm I was using, I was able to breach top 10th percentile. This is how important tuning these machine learning algorithms are.
Accelerated Gradient Boosting
Biau, Gérard, Cadre, Benoît, Rouvìère, Laurent
Gradient tree boosting is a prediction algorithm that sequentially produces a model in the form of linear combinations of decision trees, by solving an infinite-dimensional optimization problem. We combine gradient boosting and Nesterov's accelerated descent to design a new algorithm, which we call AGB (for Accelerated Gradient Boosting). Substantial numerical evidence is provided on both synthetic and real-life data sets to assess the excellent performance of the method in a large variety of prediction problems. It is empirically shown that AGB is much less sensitive to the shrinkage parameter and outputs predictors that are considerably more sparse in the number of trees, while retaining the exceptional performance of gradient boosting.
Extreme Gradient Boosting with R
Extreme Gradient Boosting is among the hottest libraries in supervised machine learning these days. It supports various objective functions, including regression, classification, and ranking. It has gained much popularity and attention recently as it was the algorithm of choice for many winning teams of a number of machine learning competitions. What makes it so popular are its speed and performance. It gives among the best performances in many machine learning applications.