Goto

Collaborating Authors

 Ensemble Learning


A (small) introduction to Boosting

#artificialintelligence

Boosting is a machine learning meta-algorithm that aims to iteratively build an ensemble of weak learners, in an attempt to generate a strong overall model. For example, consider a problem of binary classification with approximately 50% of samples belonging to each class. Random guessing in this case would yield an accuracy of around 50%. So a weak learner would be any algorithm, however simple, that slightly improves this score – say 51-55% or more. Usually, weak learners are pretty basic in nature.


Tuning Parameters for Boosting/Bagging/Random Forest • /r/MachineLearning

@machinelearnbot

Random forests usually performs quite well with the default settings. That is bootstrap resampling scheme, unpruned trees, as many trees as possible to get results in a reasonable amount of time and sqrt(#features) tried per split (mtry parameter). Then you can try to optimize the choices by checking the results on out of bag data (those each tree didnt train on because of the resampling scheme). If you have very unbalanced classes you should decide a measure of interest (such as true positive ratio) and try to tune the related parameter. Out of bag data can be trusted almost as a proper cross validation if you use enough trees and bootstrap resampling.


Accurate Sales Forecast for Data Analysts: Building a Random Forest model with Just SQL and Hivemall Treasure Data Blog

#artificialintelligence

In this blog post, we will use Hivemall, the open source Machine Learning-on-SQL library available in the Treasure Data environment, to introduce the basics of machine learning. We will use an E-Commerce dataset from Kaggle, the data science competition platform. The first challenge is predicting the retail sales for the Rossman stores (the full details at Kaggle). We will use an ensemble learning technique known as Random Forest regression. Rossman is a pharmacy chain with over 3,000 stores in seven countries within Europe.


A Complete Tutorial on Tree Based Modeling from Scratch (in R & Python)

#artificialintelligence

Tree based learning algorithms are considered to be one of the best and mostly used supervised learning methods. Tree based methods empower predictive models with high accuracy, stability and ease of interpretation. Unlike linear models, they map non-linear relationships quite well. They are adaptable at solving any kind of problem at hand (classification or regression). Methods like decision trees, random forest, gradient boosting are being popularly used in all kinds of data science problems. Hence, for every analyst (fresher also), it's important to learn these algorithms and use them for modeling. This tutorial is meant to help beginners learn tree based modeling from scratch. After the successful completion of this tutorial, one is expected to become proficient at using tree based algorithms and build predictive models. Note: This tutorial requires no prior knowledge of machine learning.


Intro to Machine Learning in H2O

#artificialintelligence

The focus of this workshop is machine learning using the H2O R and Python packages. H2O is an open source distributed machine learning platform designed for big data, with the added benefit that it's easy to use on a laptop (in addition to a multi-node Hadoop or Spark cluster). The core machine learning algorithms of H2O are implemented in high-performance Java; however, fully featured APIs are available in R, Python, Scala, REST/JSON and also through a web interface. Since H2O's algorithm implementations are distributed, this allows the software to scale to very large datasets that may not fit into RAM on a single machine. H2O currently features distributed implementations of generalized linear models, gradient boosting machines, random forest, deep neural nets, dimensionality reduction methods (PCA, GLRM), clustering algorithms (K-means), and anomaly detection methods, among others.


Ensemble Methods: Elegant Techniques to Produce Improved Machine Learning Results

#artificialintelligence

Ensemble methods are techniques that create multiple models and then combine them to produce improved results. Ensemble methods usually produces more accurate solutions than a single model would. This has been the case in a number of machine learning competitions, where the winning solutions used ensemble methods. In the popular Netflix Competition, the winner used an ensemble method to implement a powerful collaborative filtering algorithm. Another example is KDD 2009 where the winner also used ensemble methods.


Comments on: "A Random Forest Guided Tour" by G. Biau and E. Scornet

arXiv.org Machine Learning

This paper is a comment on the survey paper by Biau and Scornet (2016) about random forests. We focus on the problem of quantifying the impact of each ingredient of random forests on their performance. We show that such a quantification is possible for a simple pure forest, leading to conclusions that could apply more generally. Then, we consider "holdout" random forests, which are a good middle point between "toy" pure forests and Breiman's original random forests. We would like to thank G. Biau and E. Scornet for their clear and thought-provoking survey (Biau and Scornet, 2016).


Walmart and Random Forest

@machinelearnbot

In the recent Walmart Kaggle competition I used a Random Forest classifier to solve a market basket problem. A market basket model is built on the idea there exists relationships between items purchased together. For example, a person purchasing a new toothbrush is more likely to also purchase toothpaste than motor oil in the same shopping. Retailers use these market basket relationships in the design of their stores for ease of use and also to increase sales. In this specific problem Walmart has broken up their shopping trips into 38 unique'TripType'.


Installing XGBoost For Anaconda on Windows (IT Best Kept Secret Is Optimization)

#artificialintelligence

XGBoost is a recent implementation of Boosted Trees. It is a machine learning algorithm that yields great results on recent Kaggle competitions. I decided to install it on my computers to give it a try. Installation on OSX was straightforward using these instructions. Installation on Windows was not as straightforward.


How to Set Up Distributed XGBoost on MapR-FS

#artificialintelligence

XGBoost is a library that is designed for boosted (tree) algorithms. It has become a popular machine learning framework among data science practitioners, especially on Kaggle, which is a platform for data prediction competitions where researchers post their data and statisticians and data miners compete to produce the best models. For structured learning problems on Kaggle, it can be difficult to get into the top 10 without including XGBoost. Typically, data scientists use multi-thread single machines to train XGBoost models. Very few people have deployed XGBoost on a distributed environment and achieved good performance.