Collaborating Authors

07 -- Hands On ML -- Ensemble


Ensemble Learning is taking the predictions of multiple models and assume the output to be having the most votes. When you train multiple Decision Trees each on some random sampling of the dataset and for predictions you take predictions of all the trees, the output class would be the class which gets the most votes. This approach is called Random Forest. Voting classifier is when you train the data on multiple classifier such as Logistic Regression, SVM, RF and other classifiers and the majority vote is the predicted output class ie hard classifier. Voting can also be taken as soft by taking argmax of the outputs.

VipBoost: A More Accurate Boosting Algorithm

AAAI Conferences

Boosting is a well-known method for improving the accuracy of many learning algorithms. In this paper, we propose a novel boosting algorithm, VipBoost (voting on boosting classifications from imputed learning sets), which first generates multiple incomplete datasets from the original dataset by randomly removing a small percentage of observed attribute values, then uses an imputer to fill in the missing values.  It then applies AdaBoost (using some base learner) to produce classifiers trained on each of the imputed learning sets, to produce multiple classifiers. The subsequent prediction on a new test case is the most frequent classification from these classifiers. Our empirical results show that VipBoost produces very effective classifiers that significantly improve accuracy for unstable base learners and some stable learners, especially when the initial dataset is incomplete.

Vote-boosting ensembles Machine Learning

Vote-boosting is a sequential ensemble learning method in which the individual classifiers are built on different weighted versions of the training data. To build a new classifier, the weight of each training instance is determined in terms of the degree of disagreement among the current ensemble predictions for that instance. For low class-label noise levels, especially when simple base learners are used, emphasis should be made on instances for which the disagreement rate is high. When more flexible classifiers are used and as the noise level increases, the emphasis on these uncertain instances should be reduced. In fact, at sufficiently high levels of class-label noise, the focus should be on instances on which the ensemble classifiers agree. The optimal type of emphasis can be automatically determined using cross-validation. An extensive empirical analysis using the beta distribution as emphasis function illustrates that vote-boosting is an effective method to generate ensembles that are both accurate and robust.

Ensemble Methods: Elegant Techniques to Produce Improved Machine Learning Results


Ensemble methods are techniques that create multiple models and then combine them to produce improved results. Ensemble methods usually produces more accurate solutions than a single model would. This has been the case in a number of machine learning competitions, where the winning solutions used ensemble methods. In the popular Netflix Competition, the winner used an ensemble method to implement a powerful collaborative filtering algorithm. Another example is KDD 2009 where the winner also used ensemble methods.

Want to Win at Kaggle? Pay Attention to Your Ensembles.


Summary: Want to win a Kaggle competition or at least get a respectable place on the leaderboard? These days it's all about ensembles and for a lot of practitioners that means reaching for random forests. Random forests have indeed been very successful but it's worth remembering that there are three different categories of ensembles and some important hyper parameters tuning issues within each Here's a brief review. The Kaggle competitions are like formula racing for data science. Winners edge out competitors at the fourth decimal place and like Formula 1 race cars, not many of us would mistake them for daily drivers.