Goto

Collaborating Authors

 Ensemble Learning


An Information-Gain-based Feature Ranking Function for XGBoost

#artificialintelligence

XGBoost (short for Extreme Gradient Boosting) is a relatively new classification technique in machine learning which has won more and more popularity because of its exceptional performance in multiple competitions hosted on Kaggle.com. A lesser known benefit of using XGBoost is that the tree ensemble model can rank features for high-dimensional data sets. The official implementation of XGBoost (Python) provides only one feature scoring function called get_fscore. What it does is that, it computes feature scores by counting how many times a feature appears in the splits and rank the features according to the splits. It is simple, and it is straightforward, but I believe we should not ignore another metric which is critical to the decision tree method.


Pruning Random Forests for Prediction on a Budget

arXiv.org Machine Learning

We propose to prune a random forest (RF) for resource-constrained prediction. We first construct a RF and then prune it to optimize expected feature cost & accuracy. We pose pruning RFs as a novel 0-1 integer program with linear constraints that encourages feature re-use. We establish total unimodularity of the constraint set to prove that the corresponding LP relaxation solves the original integer program. We then exploit connections to combinatorial optimization and develop an efficient primal-dual algorithm, scalable to large datasets. In contrast to our bottom-up approach, which benefits from good RF initialization, conventional methods are top-down acquiring features based on their utility value and is generally intractable, requiring heuristics. Empirically, our pruning algorithm outperforms existing state-of-the-art resource-constrained algorithms.


Making Tree Ensembles Interpretable

arXiv.org Machine Learning

Tree ensembles, such as random forest and boosted trees, are renowned for their high prediction performance, whereas their interpretability is critically limited. In this paper, we propose a post processing method that improves the model interpretability of tree ensembles. After learning a complex tree ensembles in a standard way, we approximate it by a simpler model that is interpretable for human. To obtain the simpler model, we derive the EM algorithm minimizing the KL divergence from the complex ensemble. A synthetic experiment showed that a complicated tree ensemble was approximated reasonably as interpretable.


Ensemble Machine Learning Algorithms in Python with scikit-learn - Machine Learning Mastery

#artificialintelligence

Ensembles can give you a boost in accuracy on your dataset. In this post you will discover how you can create some of the most powerful types of ensembles in Python using scikit-learn. This case study will step you through Boosting, Bagging and Majority Voting and show you how you can continue to ratchet up the accuracy of the models on your own datasets. Ensemble Machine Learning Algorithms in Python with scikit-learn Photo by The United States Army Band, some rights reserved. It assumes you are generally familiar with machine learning algorithms and ensemble methods and that you are looking for information on how to create ensembles in Python.


XGBoost workshop and meetup talk with Tianqi Chen Data Science Los Angeles

#artificialintelligence

Proof of this and also because XGBoost has an easy-to-use interface from both R and Python, XGBoost has become a favorite tool in Kaggle competitions. Besides feature engineering, cross-validation and ensembling, XGBoost is a key ingredient for achieving the highest accuracy in many data science competitions and more importantly in practical applications. We were fortunate to recently host Tianqi Chen, the main author of XGBoost in a workshop and a meetup talk in Santa Monica, California. First, we started with an advanced workshop in the afternoon for which anyone could apply to participate but there were only a dozen spots available (which got us some expert users of XGBoost, but unfortunately we had to reject some good people too, sorry). This advanced workshop had 2 sessions.


XGBoost explained โ€ข /r/MachineLearning

#artificialintelligence

To expand: according to my naive understanding, boosted trees are basically an ensemble of decision trees which are fit sequentially so that each new tree makes up for the errors of the previously existing set of trees. The model is "boosted" by focusing new additions on correcting the residual errors of the last version of the model. The "gradient" comes in afterward as the parameters of the tree ensemble are optimized to minimize the error of the whole "base learner". I think of this as fine tuning of the boosted tree ensemble using a gradient-based optimization.


szilard/xgboost-adv-workshop-LA

#artificialintelligence

Tianqi Chen will be in Santa Monica, June 2, 2016 and besides a meetup talk in the evening (already sold out, sorry) I'm also organizing an advanced workshop in the afternoon (3:00-6:00pm) to do more advanced stuff. There will be only 10 spots for the workshop and you'll have to apply by filling out this form (Update: workshop is full.). The workshop will be a mix of Tianqi talking about more advanced stuff and participants interacting, asking questions etc. (partly hands-on, bring your laptop and your specific questions/problems/datasets). We can use this github repo (issues, PR) for setting up questions/problems/topics etc. to be discussed in the workshop, feel free to participate. Location disclosed only to the selected participants.


How to use XGBoost algorithm in R in easy steps

#artificialintelligence

Did you know using XGBoost algorithm is one of the popular winning recipe of data science competitions? So, what makes it more powerful than a traditional Random Forest or Neural Network? In the last few years, predictive modeling has become much faster and accurate. I remember spending long hours on feature engineering for improving model by few decimals. A lot of that difficult work, can now be done by using better algorithms.


Random forest - impute or remove NA values? Which is the better approach? โ€ข /r/MachineLearning

@machinelearnbot

Can you reduce the parameter space at all (using PCA or something similar)? This would probably improve your results when removing the NAs. Are the NA values present in every dimension? If there are only a couple of dimensions with NAs, try to train without them and see what happens. What does your data represent, and why are there NAs? Depending on what your data corresponds to it may make more or less sense to use imputation.


Want to Win at Kaggle? Pay Attention to Your Ensembles.

#artificialintelligence

Summary: Want to win a Kaggle competition or at least get a respectable place on the leaderboard? These days it's all about ensembles and for a lot of practitioners that means reaching for random forests. Random forests have indeed been very successful but it's worth remembering that there are three different categories of ensembles and some important hyper parameters tuning issues within each Here's a brief review. The Kaggle competitions are like formula racing for data science. Winners edge out competitors at the fourth decimal place and like Formula 1 race cars, not many of us would mistake them for daily drivers.