By Clare Liu, Data Scientist at fintech industry, based in HK. A decision tree is one of the popular and powerful machine learning algorithms that I have learned. It is a non-parametric supervised learning method that can be used for both classification and regression tasks. The goal is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features. For a classification model, the target values are discrete in nature, whereas, for a regression model, the target values are represented by continuous values.
Decision Trees are great and are useful for a variety of tasks. They form the backbone of most of the best performing models in the industry like XGboost and Lightgbm. But how do they work exactly? In fact, this is one of the most asked questions in ML/DS interviews. We generally know they work in a stepwise manner and have a tree structure where we split a node using some feature on some criterion.
Gini impurity is a measure of how often a randomly chosen element from the set would be incorrectly labeled if it was randomly labeled according to the distribution of labels in the subset. In simple terms, Gini impurity is the measure of impurity in a node. So to understand the formula a little better, let us talk specifically about the binary case where we have nodes with only two classes. So in the below five examples of candidate nodes labelled A-E and with the distribution of positive and negative class shown, which is the ideal condition to be in? I reckon you would say A or E and you are right.
This post attempts to consolidate information on tree algorithms and their implementations in Scikit-learn and Spark. In particular, it was written to provide clarification on how feature importance is calculated. There are many great resources online discussing how decision trees and random forests are created and this post is not intended to be that. Although it includes short definitions for context, it assumes the reader has a grasp on these concepts and wishes to know how the algorithms are implemented in Scikit-learn and Spark. Decision trees learn how to best split the dataset into smaller and smaller subsets to predict the target value.
What do you think would be most simple and easy way to predict the probabilities? I have touched it up a little bit. The fit method accepts a dataframe(data) and a string for the target attribute(target). Both of the them are then assigned to the object. The independent attribute names are derived and assigned to the object.