Introducing random forests in R
In this post, I will present how to use random forests in classification, a prediction technique consisting in generating a set of trees (hence, a forest) bootstrapping the features used in each tree. We do this to obtain trees that are not necessarily using the strongest predictors at the beginning. I will test this technique in a LoanDefaults dataset to predict which customers will default the paying of a loan in a specific month. This dataset has two interesting features: the number of positive cases is much smaller than the negatives and requires some preprocessing of the existing features. I will be using the ranger (RANdom forest GEneRator) package, skimr to get a summary of data, rpart and rpart.plot to generate an alternative decision tree model, BAdatasets to access the dataset, tidymodels for prediction workflow facilities and forcats for the variable importance plot.
Jul-25-2022, 11:21:15 GMT
- Technology: