Goto

Collaborating Authors

 Ensemble Learning


Tree Ensembles with Rule Structured Horseshoe Regularization

arXiv.org Machine Learning

We propose a new Bayesian model for flexible nonlinear regression and classification using tree ensembles. The model is based on the RuleFit approach in Friedman and Popescu (2008) where rules from decision trees and linear terms are used in a L1-regularized regression. We modify RuleFit by replacing the L1-regularization by a horseshoe prior, which is well known to give aggressive shrinkage of noise predictor while leaving the important signal essentially untouched. This is especially important when a large number of rules are used as predictors as many of them only contribute noise. Our horseshoe prior has an additional hierarchical layer that applies more shrinkage a priori to rules with a large number of splits, and to rules that are only satisfied by a few observations. The aggressive noise shrinkage of our prior also makes it possible to complement the rules from boosting in Friedman and Popescu (2008) with an additional set of trees from random forest, which brings a desirable diversity to the ensemble. We sample from the posterior distribution using a very efficient and easily implemented Gibbs sampler. The new model is shown to outperform state-of-the-art methods like RuleFit, BART and random forest on 16 datasets. The model and its interpretation is demonstrated on the well known Boston housing data, and on gene expression data for cancer classification. The posterior sampling, prediction and graphical tools for interpreting the model results are implemented in a publicly available R package.


Ensemble Machine Learning in Python: Random Forest, AdaBoost

@machinelearnbot

In recent years, we've seen a resurgence in AI, or artificial intelligence, and machine learning. Machine learning has led to some amazing results, like being able to analyze medical images and predict diseases on-par with human experts. Google's AlphaGo program was able to beat a world champion in the strategy game go using deep reinforcement learning. Machine learning is even being used to program self driving cars, which is going to change the automotive industry forever. Imagine a world with drastically reduced car accidents, simply by removing the element of human error.


Predicting short-term Bitcoin price fluctuations from buy and sell orders

arXiv.org Machine Learning

Bitcoin is the first decentralized digital cryptocurrency, which has showed significant market capitalization growth in last few years. It is important to understand what drives the fluctuations of the Bitcoin exchange price and to what extent they are predictable. In this paper, we study the ability to make short-term prediction of the exchange price fluctuations (measured with volatility) towards the United States dollar. We use the data of buy and sell orders collected from one of the largest Bitcoin digital trading offices in 2016 and 2017. We construct a generative temporal mixture model of the volatility and trade order book data, which is able to out-perform the current state-of-the-art machine learning and time-series statistical models. With the gate weighting function of our generative temporal mixture model, we are able to detect regimes when the features of buy and sell orders significantly affects the future high volatility periods. Furthermore, we provide insights into dynamical importance of specific features from order book such as market spread, depth, volume and ask/bid slope to explain future short-term price fluctuations.


Consistent Individualized Feature Attribution for Tree Ensembles

arXiv.org Machine Learning

Interpreting predictions from tree ensemble methods such as gradient boosting machines and random forests is important, yet feature attribution for trees is often heuristic and not individualized for each prediction. Here we show that popular feature attribution methods are inconsistent, meaning they can lower a feature's assigned importance when the true impact of that feature actually increases. This is a fundamental problem that casts doubt on any comparison between features. To address it we turn to recent applications of game theory and develop fast exact tree solutions for SHAP (SHapley Additive exPlanation) values, which are the unique consistent and locally accurate attribution values. We then extend SHAP values to interaction effects and define SHAP interaction values. We propose a rich visualization of individualized feature attributions that improves over classic attribution summaries and partial dependence plots, and a unique "supervised" clustering (clustering based on feature attributions). We demonstrate better agreement with human intuition through a user study, exponential improvements in run time, improved clustering performance, and better identification of influential features. An implementation of our algorithm has also been merged into XGBoost and LightGBM, see http://github.com/slundberg/shap for details.


Predicting Customer Churn: Extreme Gradient Boosting with Temporal Data

arXiv.org Machine Learning

Accurately predicting customer churn using large scale time-series data is a common problem facing many business domains. The creation of model features across various time windows for training and testing can be particularly challenging due to temporal issues common to time-series data. In this paper, we will explore the application of extreme gradient boosting (XGBoost) on a customer dataset with a wide-variety of temporal features in order to create a highly-accurate customer churn model. In particular, we describe an effective method for handling temporally sensitive feature engineering. The proposed model was submitted in the WSDM Cup 2018 Churn Challenge and achieved first-place out of 575 teams.


Predicting University Students' Academic Success and Choice of Major using Random Forests

arXiv.org Machine Learning

Predicting University Students' Academic Success and Choice of Major using Random Forests C edric Beaulac Jeffrey S. Rosenthal August 31,2017 Abstract In this paper, a large data set containing every course taken by every undergraduate student in a major university in Canada over 10 years is analyzed. Modern machine learning algorithms can use large data sets to build useful tools for the data provider, in this case, the university. In this article, two classifiers are constructed using random forests. To begin, the first two semesters of courses completed by a student are used to predict if they will obtain an undergraduate degree. Secondly, for the students that completed a program, their major choice is predicted using once again the first few courses they've registered to. A classification tree is an intuitive and powerful classifier and building a random forest of trees lowers the variance of the classifier and also prevents overfitting. Random forests also allow for reliable variable importance measurements. These measures explain what variables are useful to both of the classifiers and can be used to better understand what is statistically related to the students' choices. The results are two accurate classifiers and a variable importance analysis that provides useful information to the university. Keywords: Higher Education, Students' Success and Choice, Machine Learning, Classification Tree, Random Forest, Variable Importance 1 Introduction As the demand for qualified labour increases it becomes more and more important to understand what motivates students to complete their program and how they select their majors. In parallel, universities are continuously trying to improve their programs and attract more students. It would be useful for a university to be able to predict whether or not a student that begins a program will complete it.


Identifying churn drivers with Random Forests โ€“ Slav

#artificialintelligence

At RetainKit, we aim to tackle the challenging problem of churn at SaaS companies by using AI and machine learning. If you run a SaaS company and you have churn issues, we'd be happy to talk to you and see if our product could help. You can also follow us on Product Hunt Upcoming. In the early days of Post Planner (my previous startup), everything was going fine, except that it wasn't. We had built a product that solved a problem, or so we thought.


Deploy Machine Learning Models from R Research to Ruby / Go Production with PMML

@machinelearnbot

Deploying models trained in your research environment is not always a simple task. Your research environment, your production programming language, and the interplay between them may affect the ease of introducing new statistical models in production. In this blog post, I'll demonstrate the complete flow from training a Random Forest model in R, exporting it to a PMML file and finally scoring by the model in production oriented languages using Scoruby and Goscore. PMML stands for Predictive Model Markup Language and can represent models from research environments as XML files which can be later loaded and run in production. Scoruby and Goscore are code packages written by myself, that consume PMML files of various models and execute them in Go and Ruby under production memory and speed constraints.


Relevant Ensemble of Trees

arXiv.org Machine Learning

Tree ensembles are flexible predictive models that can capture relevant variables and to some extent their interactions in a compact and interpretable manner. Most algorithms for obtaining tree ensembles are based on versions of boosting or Random Forest. Previous work showed that boosting algorithms exhibit a cyclic behavior of selecting the same tree again and again due to the way the loss is optimized. At the same time, Random Forest is not based on loss optimization and obtains a more complex and less interpretable model. In this paper we present a novel method for obtaining compact tree ensembles by growing a large pool of trees in parallel with many independent boosting threads and then selecting a small subset and updating their leaf weights by loss optimization. We allow for the trees in the initial pool to have different depths which further helps with generalization. Experiments on real datasets show that the obtained model has usually a smaller loss than boosting, which is also reflected in a lower misclassification error on the test set.


How to find the contributing features of each tree in Random Forest Classifier in Python

@machinelearnbot

To do this you should have access to the tree structure of the random forest, as you are with classifier if you find the "gain" associated through the path of the variables (leaf) then you can calculate the contribution for each leaf. I cannot help more with Random forest (I am more verse with boosting).