Ensemble Learning



Comparing Different Classification Machine Learning Models for an imbalanced dataset

#artificialintelligence

A data set is called imbalanced if it contains many more samples from one class than from the rest of the classes. Data sets are unbalanced when at least one class is represented by only a small number of training examples (called the minority class) while other classes make up the majority. In this scenario, classifiers can have good accuracy on the majority class but very poor accuracy on the minority class(es) due to the influence that the larger majority class. The common example of such dataset is credit card fraud detection, where data points for fraud 1, are usually very less in comparison to fraud 0. There are many reasons why a dataset might be imbalanced: the category one is targeting might be very rare in the population, or the data might simply be difficult to collect. Let's solve the problem of an imbalanced dataset by working on one such dataset.


Interpretable AI or How I Learned to Stop Worrying and Trust AI

#artificialintelligence

Let's now look at a concrete example. The problem is to predict math, reading and writing grades for high-school students in the U.S. We are given historical data that include features like -- gender, race/ethnicity (which is anonymized), parent level of education, whether the student ate a standard/free/subsidized lunch and the level of preparation for tests. Given this data, I trained a multi-class random forest model [source code]. In order to explain what the model has learned, one of the simplest techniques is to look at the relative feature importance. Feature importance measures how big an impact a given feature has on predicting the outcome.


Demystifying Maths of Gradient Boosting – Towards Data Science

#artificialintelligence

Boosting is an ensemble learning technique. Conceptually, these techniques involve: 1. learning base learners; 2. using all of the models to come to a final prediction. Ensemble learning techniques are of different types and all differ from each other in terms of how they go about implementing the learning process for the base learners and then using their output to give out the final result. Techniques that are used in ensemble learning are Bootstrap Aggregation (a.k.a. In this article, we shall discuss briefly about Bagging and then move on to Gradient Boosting which is the focus of this article.


Random Forest Algorithm in Machine Learning

#artificialintelligence

Random forest algorithm is a one of the most popular and most powerful supervised Machine Learning algorithm in Machine Learning that is capable of performing both regression and classification tasks. As the name suggest, this algorithm creates the forest with a number of decision trees. Random Forest Algorithm in Machine Learning: Machine learning is a scientific discipline that explores the construction and study of algorithms that can learn from data. Such algorithms operate by building a model from example inputs and using that to make predictions or decisions, rather than following strictly static program instructions. Machine learning is closely related to and often overlaps with computational statistics; a discipline that also specializes in prediction-making.


Gradient Boosting to Boost the Efficiency of Hydraulic Fracturing

arXiv.org Machine Learning

Journal of Petroleum Exploration and Production Technology manuscript No. (will be inserted by the editor) Abstract In this paper we present a data-driven model for forecasting the production increase after hydraulic fracturing (HF). We use data from fracturing jobs performed at one of the Siberian oilfields. The data includes features, characterizing the jobs, and a geological information. To predict an oil rate after the fracturing machine learning (ML) technique was applied. The MLbased prediction is compared to a prediction based on the experience of reservoir and production engineers responsible for the HFjob planning.


Improve Machine Learning Results with Ensemble Learning

#artificialintelligence

NOTE: This article assumes that you are familiar with a basic understanding of Machine Learning algorithms. Suppose you want to buy a new mobile phone, will you walk directly to the first shop and purchase the mobile based on the advice of shopkeeper? You would visit some of the online mobile seller sites where you can see a variety of mobile phones, their specifications, features, and prices. You may also consider the reviews that people posted on the site. However, you probably might also ask your friends and colleagues for their opinions.


Predictive Modeling: Picking the best model – Towards Data Science

#artificialintelligence

Whether you are working on predicting data in an office setting or just competing in a Kaggle competition, it's important to test out different models to find the best fit for the data you are working with. I recently had the opportunity to compete with some very smart colleagues in a private Kaggle competition predicting faulty water pumps in Tanzania. I ran the following models after doing some data cleaning and I'll show you the results. First, we need to take a look at the data we're working with. In this particular data set, the features were in a separate file than the labels.


Optimal Minimal Margin Maximization with Boosting

arXiv.org Machine Learning

Boosting algorithms produce a classifier by iteratively combining base hypotheses. It has been observed experimentally that the generalization error keeps improving even after achieving zero training error. One popular explanation attributes this to improvements in margins. A common goal in a long line of research, is to maximize the smallest margin using as few base hypotheses as possible, culminating with the AdaBoostV algorithm by (R{\"a}tsch and Warmuth [JMLR'04]). The AdaBoostV algorithm was later conjectured to yield an optimal trade-off between number of hypotheses trained and the minimal margin over all training points (Nie et al. [JMLR'13]). Our main contribution is a new algorithm refuting this conjecture. Furthermore, we prove a lower bound which implies that our new algorithm is optimal.


A XGBoost risk model via feature selection and Bayesian hyper-parameter optimization

arXiv.org Machine Learning

This paper aims to explore models based on the extreme gradient boosting (XGBoost) approach for business risk classification. Feature selection (FS) algorithms and hyper-parameter optimizations are simultaneously considered during model training. The five most commonly used FS methods including weight by Gini, weight by Chi-square, hierarchical variable clustering, weight by correlation, and weight by information are applied to alleviate the effect of redundant features. Two hyper-parameter optimization approaches, random search (RS) and Bayesian tree-structured Parzen Estimator (TPE), are applied in XGBoost. The effect of different FS and hyper-parameter optimization methods on the model performance are investigated by the Wilcoxon Signed Rank Test. The performance of XGBoost is compared to the traditionally utilized logistic regression (LR) model in terms of classification accuracy, area under the curve (AUC), recall, and F1 score obtained from the 10-fold cross validation. Results show that hierarchical clustering is the optimal FS method for LR while weight by Chi-square achieves the best performance in XG-Boost. Both TPE and RS optimization in XGBoost outperform LR significantly. TPE optimization shows a superiority over RS since it results in a significantly higher accuracy and a marginally higher AUC, recall and F1 score. Furthermore, XGBoost with TPE tuning shows a lower variability than the RS method. Finally, the ranking of feature importance based on XGBoost enhances the model interpretation. Therefore, XGBoost with Bayesian TPE hyper-parameter optimization serves as an operative while powerful approach for business risk modeling.