Goto

Collaborating Authors

 Ensemble Learning


Generative and reproducible benchmarks for comprehensive evaluation of machine learning classifiers

arXiv.org Artificial Intelligence

Understanding the strengths and weaknesses of machine learning (ML) algorithms is crucial for determine their scope of application. Here, we introduce the DIverse and GENerative ML Benchmark (DIGEN) - a collection of synthetic datasets for comprehensive, reproducible, and interpretable benchmarking of machine learning algorithms for classification of binary outcomes. The DIGEN resource consists of 40 mathematical functions which map continuous features to discrete endpoints for creating synthetic datasets. These 40 functions were discovered using a heuristic algorithm designed to maximize the diversity of performance among multiple popular machine learning algorithms thus providing a useful test suite for evaluating and comparing new methods. Access to the generative functions facilitates understanding of why a method performs poorly compared to other algorithms thus providing ideas for improvement. The resource with extensive documentation and analyses is open-source and available on GitHub.


When Getting It Right Gets It Wrong

#artificialintelligence

In a previous post I briefly touched on the problem with overfitting, which is loosely defined as a machine learning model that memorizes a training data set and thus provides high accuracy for predictions using it, but then performs poorly when presented with new data -- a phenomenon known as variance. The post discussed the Random Forest approach using bootstrap aggregation to address this issue, but it begged the question: "Why does intentionally producing lower-quality data sets and averaging across their results produce better predictions?" Reality, it turns out, is messy, so intentionally introducing inaccuracy in the process of producing predictions (that's some impressive alliteration, don't you think?) usually makes them better. It's a process known as regularization. It turns out that all kinds of machine learning algorithms have overfitting risks, and they way you regularize depends on the model you're trying to fit.


Using AntiPatterns to avoid MLOps Mistakes

#artificialintelligence

Different values of hyper-parameters often prove to be significant drivers of model performance and are expensive to tune and mostly task specific. Hyper-parameters play such a crucial role in modeling architectures that entire research efforts are devoted to developing efficient hyper-parameter search strategies (Bergstra et al., 2013; Nguyen et al., 2019; Henderson et al., 2018; Van Rijn and Hutter, 2018; Probst et al., 2019). The set of hyper-parameters differs for different learning algorithms. For instance, even a simple classification model like the decision tree classifier, has hyper-parameters like the maximum depth of the tree, the minimum number of samples to split an internal node and the criterion to use for estimating either the impurity at a node (gini) or the information gain (entropy) at each node. Ensemble models like random forest classifiers and gradient boosting machines also have additional parameters governing the number of estimators (trees) to include in the model.


Leveraging Machine Learning to Detect Fraud: Tips to Developing a Winning Kaggle Solution

#artificialintelligence

A value count on the target label shows that only 3.5% of the transactions are labeled fraudulent. Typically, fraudulent transactions make up a small percentage of transactions. Correlation can help you understand the linear relationship between features and between features and the target. A correlation can range between -1 (perfect negative relationship) and 1 (perfect positive relationship), with 0 indicating no straight-line relationship. Visualizing the data helps with feature selection by revealing trends in the data.


Boost-R: Gradient Boosted Trees for Recurrence Data

arXiv.org Machine Learning

Recurrence data arise from multi-disciplinary domains spanning reliability, cyber security, healthcare, online retailing, etc. This paper investigates an additive-tree-based approach, known as Boost-R (Boosting for Recurrence Data), for recurrent event data with both static and dynamic features. Boost-R constructs an ensemble of gradient boosted additive trees to estimate the cumulative intensity function of the recurrent event process, where a new tree is added to the ensemble by minimizing the regularized L2 distance between the observed and predicted cumulative intensity. Unlike conventional regression trees, a time-dependent function is constructed by Boost-R on each tree leaf. The sum of these functions, from multiple trees, yields the ensemble estimator of the cumulative intensity. The divide-and-conquer nature of tree-based methods is appealing when hidden sub-populations exist within a heterogeneous population. The non-parametric nature of regression trees helps to avoid parametric assumptions on the complex interactions between event processes and features. Critical insights and advantages of Boost-R are investigated through comprehensive numerical examples. Datasets and computer code of Boost-R are made available on GitHub. To our best knowledge, Boost-R is the first gradient boosted additive-tree-based approach for modeling large-scale recurrent event data with both static and dynamic feature information.


Prediction of the final rank of Players in PUBG with the optimal number of features

arXiv.org Artificial Intelligence

PUBG is an online video game that has become very popular among the youths in recent years. Final rank, which indicates the performance of a player, is one of the most important feature for this game. This paper focuses on predicting the final rank of the players based on their skills and abilities. In this paper we have used different machine learning algorithms to predict the final rank of the players on a dataset obtained from kaggle which has 29 features. Using the correlation heatmap,we have varied the number of features used for the model. Out of these models GBR and LGBM have given the best result with the accuracy of 91.63% and 91.26% respectively for 14 features and the accuracy of 90.54% and 90.01% for 8 features. Although the accuracy of the models with 14 features is slightly better than 8 features, the empirical time taken by 8 features is 1.4x lesser than 14 features for LGBM and 1.5x lesser for GBR. Furthermore, reducing the number of features any more significantly hampers the performance of all the ML models. Therefore, we conclude that 8 is the optimal number of features that can be used to predict the final rank of a player in PUBG with high accuracy and low run-time.


A machine learning, bias-free approach for predicting business success using Crunchbase data

#artificialintelligence

Promising results were obtained with the gradient boosting classifier. Predicting the success of a business venture has always been a struggle for both practitioners and researchers. However, thanks to companies that aggregate data about other firms, it has become possible to create and validate predictive models based on an unprecedented amount of real-world examples. In this study, we use data obtained from one of the largest platforms integrating business information โ€“ Crunchbase. Our final training set consisted of 213 171 companies.


Giuliano Liguori on Twitter

#artificialintelligence

โ€œ🔝 #MachineLearning Prediction Algorithms {#infographic} by @DatumGuy Random regression Logistic regression Decision Tree Random forest Gradient Boosting @antgrasso @Ronald_vanLoon @KirkDBorne @SpirosMargaris @mvollmer1 @machinelearnflx @AISOMA_AG @andy_fitze @SwissCognitiveโ€


A Comprehensive Guide to Ensemble Learning - What Exactly Do You Need to Know - neptune.ai

#artificialintelligence

Ensemble learning techniques have been proven to yield better performance on machine learning problems. We can use these techniques for regression as well as classification problems. The final prediction from these ensembling techniques is obtained by combining results from several base models. Averaging, voting and stacking are some of the ways the results are combined to obtain a final prediction. In this article, we will explore how ensemble learning can be used to come up with optimal machine learning models. Ensemble learning is a combination of several machine learning models in one problem.


Random Forest

#artificialintelligence

Random forest is a Supervised Machine Learning Algorithm that is used widely in Classification and Regression problems. It builds decision trees on different samples and takes their majority vote for classification and average in case of regression. One of the most important features of the Random Forest Algorithm is that it can handle the data set containing continuous variables as in the case of regression and categorical variables as in the case of classification. It performs better results for classification problems. Let's dive into a real-life analogy to understand this concept further.