Ensemble Learning
Bootstrap aggregating - Wikipedia, the free encyclopedia
Bootstrap aggregating, also called bagging, is a machine learning ensemble meta-algorithm designed to improve the stability and accuracy of machine learning algorithms used in statistical classification and regression. It also reduces variance and helps to avoid overfitting. Although it is usually applied to decision tree methods, it can be used with any type of method. Bagging is a special case of the model averaging approach. Bagging (Bootstrap aggregating) was proposed by Leo Breiman in 1994 to improve the classification by combining classifications of randomly generated training sets.
Why are Gradient Boosting Models poor at making predictions? โข /r/MachineLearning
Your question is confusing, because in machine learning we call any output of the model a prediction. But I gather you are talking about extrapolation. As in you have trained on values for t 10 and want to know the future value at t 10. If you want to use GBM or RF, all you need to do is to make sure your input features are in the same range in training as in testing. So you won't be able to have a year feature where the training data has values in (1970,2010) and predict for 2016. But an input dimension that encodes the weekday would work (if the weekday is meaningful for your predictions), because you are dealing with the same Mon-Sun range in training and testing.
FastBDT: GBDT C /Python Library (code and paper). Claims fit speed superior to Xgboost โข /r/MachineLearning
This paper presents a speed-optimized and cache-friendly implementation for multivariate classification called FastBDT. FastBDT is one order of magnitude faster during the fitting-phase and application-phase, in comparison with popular implementations in software frameworks like TMVA, scikit-learn and XGBoost. The concepts used to optimize the execution time and performance studies are discussed in detail in this paper. The key ideas include: An equal-frequency binning on the input data, which allows replacing expensive floating-point with integer operations, while at the same time increasing the quality of the classification; a cache-friendly linear access pattern to the input data, in contrast to usual implementations, which exhibit a random access pattern.
Why isn't XGBoost a more popular research topic? โข /r/MachineLearning
These are well understood models. They are gradient boosted trees with some optimizations. The theory has been around for decades, so there is not much to be uncovered, the library is actually application of established research. Also, there is a lull in new research ideas related to trees. Perhaps you can come up with some new techniques to improve trees further, and welcome in a new era of tree-based model research?
Calibrating random forests for probability estimation - Dankowski - 2016 - Statistics in Medicine - Wiley Online Library
Probabilities can be consistently estimated using random forests. It is, however, unclear how random forests should be updated to make predictions for other centers or at different time points. The first method has been proposed by Elkan and may be used for updating any machine learning approach yielding consistent probabilities, so-called probability machines. The second approach is a new strategy specifically developed for random forests. Using the terminal nodes, which represent conditional probabilities, the random forest is first translated to logistic regression models.
How to Configure the Gradient Boosting Algorithm - Machine Learning Mastery
We can see a few interesting things in this table. In a similar talk by Owen at ODSC Boston 2015 titled "Open Source Tools and Data Science Competitions", he again summarized common parameters he uses: We can see some minor differences that may be relevant. Finally, Abhishek Thakur, in his post titled "Approaching (Almost) Any Machine Learning Problem" provided a similar table listing out key XGBoost parameters and suggestions for tuning. The spreads do cover the general defaults suggested above and more. It is interesting to note that Abhishek does provides some suggestions for tuning the alpha and beta model penalization terms as well as row sampling. You can develop and evaluate XGBoost models in just a few lines of Python code.
Fast and Scalable Machine Learning in R and Python with H2O
The focus of this talk is scalable machine learning using the H2O R and Python packages. H2O is an open source distributed machine learning platform designed for big data, with the added benefit that it's easy to use on a laptop (in addition to a multi-node Hadoop or Spark cluster). The core machine learning algorithms of H2O are implemented in high-performance Java; however, fully featured APIs are available in R, Python, Scala, REST/JSON and also through a web interface. Since H2O's algorithm implementations are distributed, this allows the software to scale to very large datasets that may not fit into RAM on a single machine. H2O currently features distributed implementations of generalized linear models, gradient boosting machines, random forest, deep neural nets, dimensionality reduction methods (PCA, GLRM), clustering algorithms (K-means), and anomaly detection methods, among others.
Gradient Boosting explained by Alex Rogozhnikov
Gradient boosting (GB) is a machine learning algorithm developed in the late '90s that is still very popular. It produces state-of-the-art results for many commercial (and academic) applications. This page explains how the gradient boosting algorithm works using several interactive visualizations. We take a 2-dimensional regression problem and investigate how a tree is able to reconstruct the function \( y f(\vx) f(x_1, x_2) \). Play with the tree depth, then look at the tree-building process from above!