Ensemble Learning
LightGBM: A Highly Efficient Gradient Boosting Decision Tree
Ke, Guolin, Meng, Qi, Finley, Thomas, Wang, Taifeng, Chen, Wei, Ma, Weidong, Ye, Qiwei, Liu, Tie-Yan
Gradient Boosting Decision Tree (GBDT) is a popular machine learning algorithm, and has quite a few effective implementations such as XGBoost and pGBRT. Although many engineering optimizations have been adopted in these implementations, the efficiency and scalability are still unsatisfactory when the feature dimension is high and data size is large. A major reason is that for each feature, they need to scan all the data instances to estimate the information gain of all possible split points, which is very time consuming. To tackle this problem, we propose two novel techniques: \emph{Gradient-based One-Side Sampling} (GOSS) and \emph{Exclusive Feature Bundling} (EFB). With GOSS, we exclude a significant proportion of data instances with small gradients, and only use the rest to estimate the information gain. We prove that, since the data instances with larger gradients play a more important role in the computation of information gain, GOSS can obtain quite accurate estimation of the information gain with a much smaller data size. With EFB, we bundle mutually exclusive features (i.e., they rarely take nonzero values simultaneously), to reduce the number of features. We prove that finding the optimal bundling of exclusive features is NP-hard, but a greedy algorithm can achieve quite good approximation ratio (and thus can effectively reduce the number of features without hurting the accuracy of split point determination by much). We call our new GBDT implementation with GOSS and EFB \emph{LightGBM}. Our experiments on multiple public datasets show that, LightGBM speeds up the training process of conventional GBDT by up to over 20 times while achieving almost the same accuracy.
Cost efficient gradient boosting
Peter, Sven, Diego, Ferran, Hamprecht, Fred A., Nadler, Boaz
Many applications require learning classifiers or regressors that are both accurate and cheap to evaluate. Prediction cost can be drastically reduced if the learned predictor is constructed such that on the majority of the inputs, it uses cheap features and fast evaluations. The main challenge is to do so with little loss in accuracy. In this work we propose a budget-aware strategy based on deep boosted regression trees. In contrast to previous approaches to learning with cost penalties, our method can grow very deep trees that on average are nonetheless cheap to compute. We evaluate our method on a number of datasets and find that it outperforms the current state of the art by a large margin. Our algorithm is easy to implement and its learning time is comparable to that of the original gradient boosting.
Random Forest in Python – William Koehrsen – Medium
There has never been a better time to get into machine learning. With the learning resources available online, free open-source tools with implementations of any algorithm imaginable, and the cheap availability of computing power through cloud services such as AWS, machine learning is truly a field that has been democratized by the internet. Anyone with access to a laptop and a willingness to learn can try out state-of-the-art algorithms in minutes. With a little more time, you can develop practical models to help in your daily life or at work (or better yet, switch into the machine learning field and reap the economic benefits). This post will walk you through an end-to-end implementation of the powerful random forest machine learning model. It is meant to serve as a complement to my conceptual explanation of the random forest, but can be read entirely on its own as long as you have the basic idea of a decision tree and a random forest. There will of course be Python code here, however, it is not meant to intimate anyone, but rather to show how accessible machine learning is with the resources available today!
H2O4GPU Hands-On Lab (Video) Updates - H2O.ai Blog
Deep learning algorithms have benefited significantly from the recent performance gains of GPUs. However, it has been uncertain whether GPUs can speed up powerful classical machine learning algorithms such as generalized linear modeling, random forests, gradient boosting machines, clustering, and singular value decomposition. Today I'd love to share another interesting presentation from #H2OWorld focused on H2O4GPU. H2O4GPU is a GPU-optimized machine learning library with a Python scikit-learn API tailored for enterprise AI. The library includes all the CPU algorithms from scikit-learn and also has selected algorithms that benefit greatly from GPU acceleration. In the video below, Jon McKinney, Director of Research at H2O.ai, discussed the GPU-optimized machine learning algorithms in H2O4GPU and showed their speed in a suite of benchmarks against scikit-learn run on CPUs.
How to Win a Data Science Competition: Learn from Top Kagglers Coursera
About this course: If you want to break into competitive data science, then this course is for you! Participating in predictive modelling competitions can help you gain practical experience, improve and harness your data modelling skills in various domains such as credit, insurance, marketing, natural language processing, sales' forecasting and computer vision to name a few. At the same time you get to do it in a competitive context against thousands of participants where each one tries to build the most predictive algorithm. Pushing each other to the limit can result in better performance and smaller prediction errors. Being able to achieve high ranks consistently can help you accelerate your career in data science.
Practical Tutorial on Random Forest and Parameter Tuning in R Tutorials & Notes Machine Learning HackerEarth
Random Forest is one of the most versatile machine learning algorithms available today. With its built-in ensembling capacity, the task of building a decent generalized model (on any dataset) gets much easier. However, I've seen people using random forest as a black box model; i.e., they don't understand what's happening beneath the code. In fact, the easiest part of machine learning is coding. If you are new to machine learning, the random forest algorithm should be on your tips.
Introduction to Random Forests
Let's load the data into a Pandas dataframe using urlopen from the urllib.request Instead of downloading a csv, I grabbed the data straight from the UCI Machine Learning Database using an http request, a method inspired by Python tutorials from the University of California, Santa Barbara's data science course. I recommend that you keep a static file for your data set as well. Now, create a list with the appropriate names and set them as the data frame's column names. You'll need to do some minor cleaning, such as setting the id_number to the data frame index and converting the diagnosis to the standard binary 1, 0 representation using the map() function.
Top Data Science and Machine Learning Methods Used
The average respondent used 7.7 tools/methods, similar to 2016 poll. Next, we compared the top 16 methods in this year's poll with their share last year - see Figure 1. We note a significant increase in Random Forests, Visualization, and Deep Learning share of usage, and decline in K-nn, PCA, and Boosting. Gradient Boosting Machines was a new entry in 2017. Deep Learning, despite its amazing successes, is reported used by only about 20% of KDnuggets readers.
Regression prediction intervals with XGBOOST
Knowledge of the uncertainty in predictions of algorithms is paramount for anyone who wishes to make serious predictive analytics for his business. Predictions are never absolute, and it is imperative to know the potential variations. If one wishes to know the passengers volume for each flight, he also needs to know by how many passengers the prediction may differ. Another could decide to predict disembarking times. There is of course a difference between a prediction on a scale of a few hours with a 95% chance of correctness up to half an hour, and a potential error of 10 hours!
Gradient Boosting from scratch – ML Review – Medium
Although most of the Kaggle competition winners use stack/ensemble of various models, one particular model that is part of most of the ensembles is some variant of Gradient Boosting (GBM) algorithm. Take for an example the winner of latest Kaggle competition: Michael Jahrer's solution with representation learning in Safe Driver Prediction. His solution was a blend of 6 models. 1 LightGBM (a variant of GBM) and 5 Neural Nets. Although his success is attributed to the new semi-supervised learning that he invented for the structured data, but gradient boosting model has done the useful part too. Even though GBM is being used widely, many practitioners still treat it as complex black-box algorithm and just run the models using pre-built libraries.