Goto

Collaborating Authors

 Ensemble Learning


A Comparison of Resampling and Recursive Partitioning Methods in Random Forest for Estimating the Asymptotic Variance Using the Infinitesimal Jackknife

arXiv.org Machine Learning

The infinitesimal jackknife (IJ) has recently been applied to the random forest to estimate its prediction variance. These theorems were verified under a traditional random forest framework which uses classification and regression trees (CART) and bootstrap resampling. However, random forests using conditional inference (CI) trees and subsampling have been found to be not prone to variable selection bias. Here, we conduct simulation experiments using a novel approach to explore the applicability of the IJ to random forests using variations on the resampling method and base learner. Test data points were simulated and each trained using random forest on one hundred simulated training data sets using different combinations of resampling and base learners. Using CI trees instead of traditional CART trees as well as using subsampling instead of bootstrap sampling resulted in a much more accurate estimation of prediction variance when using the IJ. The random forest variations here have been incorporated into an open source software package for the R programming language.


Extreme Gradient Boosting with XGBoost

@machinelearnbot

Do you know the basics of supervised learning and want to learn to use state-of-the-art models on real-world datasets? Gradient boosting is currently one of the most popular techniques for efficient modeling of tabular datasets of all sizes. XGboost is a very fast, scalable implementation of gradient boosting that has taken data science by storm, with models using XGBoost regularly winning many online data science competitions and used at scale across different industries. In this course, you'll learn how to use this powerful library alongside pandas and scikit-learn to build and tune supervised learning models. You'll work with real-world datasets to solve classification as well as regression problems.


Optimizing Prediction Intervals by Tuning Random Forest via Meta-Validation

arXiv.org Machine Learning

Recent studies have shown that tuning prediction models increases prediction accuracy and that Random Forest can be used to construct prediction intervals. However, to our best knowledge, no study has investigated the need to, and the manner in which one can, tune Random Forest for optimizing prediction intervals { this paper aims to fill this gap. We explore a tuning approach that combines an effectively exhaustive search with a validation technique on a single Random Forest parameter. This paper investigates which, out of eight validation techniques, are beneficial for tuning, i.e., which automatically choose a Random Forest configuration constructing prediction intervals that are reliable and with a smaller width than the default configuration. Additionally, we present and validate three meta-validation techniques to determine which are beneficial, i.e., those which automatically chose a beneficial validation technique. This study uses data from our industrial partner (Keymind Inc.) and the Tukutuku Research Project, related to post-release defect prediction and Web application effort estimation, respectively. Results from our study indicate that: i) the default configuration is frequently unreliable, ii) most of the validation techniques, including previously successfully adopted ones such as 50/50 holdout and bootstrap, are counterproductive in most of the cases, and iii) the 75/25 holdout meta-validation technique is always beneficial; i.e., it avoids the likely counterproductive effects of validation techniques.


Gradient Boosting in TensorFlow vs XGBoost

@machinelearnbot

Tensorflow 1.4 was released a few weeks ago with an implementation of Gradient Boosting, called TensorFlow Boosted Trees (TFBT). Unfortunately, the paper does not have any benchmarks, so I ran some against XGBoost. For many Kaggle-style data mining problems, XGBoost has been the go-to solution since its release in 2006. It's probably as close to an out-of-the-box machine learning algorithm as you can get today, as it gracefully handles un-normalized or missing data, while being accurate and fast to train. The code to reproduce the results in this article is on GitHub.


How to Install XGBoost for Python on macOS - Machine Learning Mastery

#artificialintelligence

XGBoost is a library for developing very fast and accurate gradient boosting models. It is a library at the center of many winning solutions in Kaggle data science competitions. In this tutorial, you will discover how to install the XGBoost library for Python on macOS. How to Install XGBoost for Python on macOS Photo by auntjojo, some rights reserved. Note: I have used this procedure for years on a range of different macOS versions and it has not changed.


Learn Gradient Boosting Algorithm for better predictions (with codes in R)

@machinelearnbot

The accuracy of a predictive model can be boosted in two ways: Either by embracing feature engineering or by applying boosting algorithms straight away. Having participated in lots of data science competition, I've noticed that people prefer to work with boosting algorithms as it takes less time and produces similar results. There are multiple boosting algorithms like Gradient Boosting, XGBoost, AdaBoost, Gentle Boost etc. Every algorithm has its own underlying mathematics and a slight variation is observed while applying them. If you are new to this, Great! You shall be learning all these concepts in a week's time from now.


Gradient Boosting in TensorFlow vs XGBoost - Nicolรฒ Valigi

@machinelearnbot

Tensorflow 1.4 was released a few weeks ago with an implementation of Gradient Boosting, called TensorFlow Boosted Trees (TFBT). Unfortunately, the paper does not have any benchmarks, so I ran some against XGBoost. For many Kaggle-style data mining problems, XGBoost has been the go-to solution since its release in 2006. It's probably as close to an out-of-the-box machine learning algorithm as you can get today, as it gracefully handles un-normalized or missing data, while being accurate and fast to train. The code to reproduce the results in this article is on GitHub.


slundberg/shap

#artificialintelligence

SHAP (SHapley Additive exPlanations) explains the output of any machine learning model using expectations and Shapley values. SHAP unifies aspects of several previous methods [1-7] and represents the only possible consistent and locally accurate additive feature attribution method based on expectations (see SHAP paper for details). While SHAP values can explain the output of any machine learning model, we have developed a high-speed exact algorithm for ensemble tree methods (Tree SHAP paper). The above explanation shows features each contributing to push the model output from the base value (the average model output over the training dataset we passed) to the model output. Features pushing the prediction higher are shown in red, those pushing the prediction lower are in blue.


Cosmic String Detection with Tree-Based Machine Learning

arXiv.org Machine Learning

We explore the use of random forest and gradient boosting, two powerful tree-based machine learning algorithms, for the detection of cosmic strings in maps of the cosmic microwave background (CMB), through their unique Gott-Kaiser-Stebbins effect on the temperature anisotropies.The information in the maps is compressed into feature vectors before being passed to the learning units. The feature vectors contain various statistical measures of processed CMB maps that boost the cosmic string detectability. Our proposed classifiers, after training, give results improved over or similar to the claimed detectability levels of the existing methods for string tension, $G\mu$. They can make $3\sigma$ detection of strings with $G\mu \gtrsim 2.1\times 10^{-10}$ for noise-free, $0.9'$-resolution CMB observations. The minimum detectable tension increases to $G\mu \gtrsim 3.0\times 10^{-8}$ for a more realistic, CMB S4-like (II) strategy, still a significant improvement over the previous results.


Hyperparameter Tuning the Random Forest in Python โ€“ Towards Data Science

#artificialintelligence

I have included Python code in this article where it is most instructive. Full code and data to follow along can be found on the project Github page. The best way to think about hyperparameters is like the settings of an algorithm that can be adjusted to optimize performance, just as we might turn the knobs of an AM radio to get a clear signal (or your parents might have!). While model parameters are learned during training -- such as the slope and intercept in a linear regression -- hyperparameters must be set by the data scientist before training. In the case of a random forest, hyperparameters include the number of decision trees in the forest and the number of features considered by each tree when splitting a node.