Ensemble Learning
Random Forests for Big Data
Genuer, Robin, Poggi, Jean-Michel, Tuleau-Malot, Christine, Villa-Vialaneix, Nathalie
Big Data is one of the major challenges of statistical science and has numerous consequences from algorithmic and theoretical viewpoints. Big Data always involve massive data but they also often include online data and data heterogeneity. Recently some statistical methods have been adapted to process Big Data, like linear regression models, clustering methods and bootstrapping schemes. Based on decision trees combined with aggregation and bootstrap ideas, random forests were introduced by Breiman in 2001. They are a powerful nonparametric statistical method allowing to consider in a single and versatile framework regression problems, as well as two-class and multi-class classification problems. Focusing on classification problems, this paper proposes a selective review of available proposals that deal with scaling random forests to Big Data problems. These proposals rely on parallel environments or on online adaptations of random forests. We also describe how related quantities -- such as out-of-bag error and variable importance -- are addressed in these methods. Then, we formulate various remarks for random forests in the Big Data context. Finally, we experiment five variants on two massive datasets (15 and 120 millions of observations), a simulated one as well as real world data. One variant relies on subsampling while three others are related to parallel implementations of random forests and involve either various adaptations of bootstrap to Big Data or to "divide-and-conquer" approaches. The fifth variant relates on online learning of random forests. These numerical experiments lead to highlight the relative performance of the different variants, as well as some of their limitations.
Winning Tips on Machine Learning Competitions by Kazanova, Current Kaggle #3
No matter how many books you read, tutorials you finish or problems you solve, there will always be a data set you might come across where you get clueless. Specially, when you are in your early days of Machine Learning. In this blog post, you'll learn some essential tips on building machine learning models which most people learn with experience. These tips were shared by Marios Michailidis (a.k.a Kazanova), Kaggle Grandmaster, Current Rank #3 in a webinar happened on 5th March 2016. The key to succeeding in competitions is perseverance. Marios said, 'I won my first competition (Acquired valued shoppers challenge) and entered kaggle's top 20 after a year of continued participation on 4 GB RAM laptop (i3)'. Were you planning to give up? While reading Q & As, if you have any questions, please feel free to drop them in comments!
A Simple XGBoost Tutorial Using the Iris Dataset
I had the opportunity to start using xgboost machine learning algorithm, it is fast and shows good results. Here I will be using multiclass prediction with the iris dataset from scikit-learn. In order to work with the data, I need to install various scientific libraries for python. The best way I have found is to use Anaconda. It simply installs all the libs and helps to install new ones.
Tuning hyperparams fast with Hyperband - FastML
Hyperband is a relatively new method for tuning iterative algorithms. It performs random sampling and attempts to gain the edge by using time spent optimizing in the best way. We explain a few things that were not clear to us right away, and try the algorithm in practice. Candidates for tuning with Hyperband include all the SGD derivatives - meaning the whole deep learning - and tree ensembles: gradient boosting, and perhaps to a lesser extent, random forest and extremely randomized trees. To quantify this idea, we compare to random run at twice the speed which beats the two Bayesian Optimization methods, i.e., running random search for twice as long yields superior results.
Indoor Localization by Fusing a Group of Fingerprints Based on Random Forests
Guo, Xiansheng, Ansari, Nirwan, Li, Huiyong
Indoor localization based on SIngle Of Fingerprint (SIOF) is rather susceptible to the changing environment, multipath, and non-line-of-sight (NLOS) propagation. Building SIOF is also a very time-consuming process. Recently, we first proposed a GrOup Of Fingerprints (GOOF) to improve the localization accuracy and reduce the burden of building fingerprints. However, the main drawback is the timeliness. In this paper, we propose a novel localization framework by Fusing A Group Of fingerprinTs (FAGOT) based on random forests. In the offline phase, we first build a GOOF from different transformations of the received signals of multiple antennas. Then, we design multiple GOOF strong classifiers based on Random Forests (GOOF-RF) by training each fingerprint in the GOOF. In the online phase, we input the corresponding transformations of the real measurements into these strong classifiers to obtain multiple independent decisions. Finally, we propose a Sliding Window aIded Mode-based (SWIM) fusion algorithm to balance the localization accuracy and time. Our proposed approaches can work better in an unknown indoor scenario. The burden of building fingerprints can also be reduced drastically. We demonstrate the performance of our algorithms through simulations and real experimental data using two Universal Software Radio Peripheral (USRP) platforms.
Day05: Recognition with random forest
Day05 brings another Kaggle competition, "Digit Recognizer". The goal in this competition is to take an image of a handwritten single digit, and determine what that digit is. The Jupyter Notebook for this little project is found here. Today, I think I'll use a random forest classifier. A random forest classifier is like an election where the outcome with the most votes (from the ensemble of decision trees) is the predicted classification.
Tuning hyperparams fast with Hyperband
Hyperband is a method for tuning iterative algorithms. It uses random sampling and attempts to gain the edge by using time spent optimizing in the best way. We explain a few things that were not clear to us right away, and try the algorithm in practice. Come back later for la version final. For us, the name conjures an idea of some fancy topological construct.
rxNeuralNet vs. xgBoost vs. H2O
Recently, I did a session at local user group in Ljubljana, Slovenija, where I introduced the new algorithms that are available with MicrosoftML package for Microsoft R Server 9.0.3. For dataset, I have used two from (still currently) running sessions from Kaggle. In the last part, I did image detection and prediction of MNIST dataset and compared the performance and accuracy between. MNIST Handwritten digit database is available here. Starting off with rxNeuralNet, we have to build a NET# model or Neural network to work it's way.
Extreme Gradient Boosting and Behavioral Biometrics
Manning, Benjamin (University of Georgia)
As insider hacks become more prevalent it is becoming more useful to identify valid users from the inside of a system rather than from the usual external entry points where exploits are used to gain entry. One of the main goals of this study was to ascertain how well Gradient Boosting could be used for prediction or, in this case, classification or identification of a specific user through the learning of HCI-based behavioral biometrics. If applicable, this procedure could be used to verify users after they have gained entry into a protected system using data that is as human-centric as other biometrics, but less invasive. For this study an Extreme Gradient Boosting algorithm was used for training and testing on a dataset containing keystroke dynamics information. This specific algorithm was chosen because the majority of current research utilizes mainstream methods such as KNN and SVM and the hypothesis of this study was centered on the potential applicability of ensemble related decision or model trees. The final predictive model produced an accuracy of 0.941 with a Kappa value of 0.942 demonstrating that HCI-based behavioral biometrics in the form of keystroke dynamics can be used to identify the users of a system.
Webinar: Improve Your Regression with CART and Gradient Boosting
In this webinar we'll introduce you to a powerful tree-based machine learning algorithm called gradient boosting. Gradient boosting often outperforms linear regression, Random Forests, and CART. Boosted trees automatically handle variable selection, variable interactions, nonlinear relationships, outliers, and missing values. We'll see that CART decision trees are the foundation of gradient boosting and discuss some of the advantages of boosting versus a Random Forest. We will explore the gradient boosting algorithm and discuss the most important modeling parameters like the learning rate, number of terminal nodes, number of trees, loss functions, and more.