This blog covers another interesting machine learning algorithm called Decision Trees and it's mathematical implementation. At every point in our life, we make some decisions to proceed further. Similarly, this machine learning algorithm also makes the same decisions on the dataset provided and figures out the best splitting or decision at each step to improve the accuracy and make better decisions. This, in turn, helps in giving valuable results. A decision tree is a machine learning algorithm which represents a hierarchical division of dataset to form a tree based on certain parameters.

Decision trees are simple to implement and equally easy to interpret. And decision trees are idea for machine learning newcomers as well! If you are unsure about even one of these questions, you've come to the right place! Decision Tree is a powerful machine learning algorithm that also serves as the building block for other widely used and complicated machine learning algorithms like Random Forest, XGBoost, and LightGBM. You can imagine why it's important to learn about this topic!

Here we will build a Python(-ic/-esque) Random Forest. Since with python everything is made so easy that you can easily build very complex machines out from one or two libraries, it is better to delve into basic topics before dipping our nose into untameable beasts. Let us start from a single "decision tree" (a simple problem). After that we will extend our knowledge and learn to build a Random Forest and an application to a real problem. To warm up, we will start with a toy problem, with only two features and two classes.

After the surprising results of the 2016 presidential election, I wanted to better understand the socio-economic and cultural factors that played a role in voting behavior. With the election results in the books, I thought it would be fun to reverse-engineer a predictive model of voting behavior based on some of the widely available county-level data sets. For example, if you want to answer the question "how could the election have been different if the percentage of people with at least a bachelor's degree had been 2% higher nationwide?" you can simply toggle that parameter up to 1.02 and click "Submit" to find out. The predictions are driven by a random forest classification model that has been tuned and trained on 71 distinct county-level attributes. Using real data, the model has a predictive accuracy of 94.6% and an ROC AUC score of 96%.

ABSTRACT The default variable-importance measure in random Forests, Gini importance, has been shown to suffer from the bias of the underlying Gini-gain splitting criterion. While the alternative permutation importance is generally accepted as a reliable measure of variable importance, it is also computationally demanding and suffers from other shortcomings. We propose a simple solution to the misleading/untrustworthy Gini importance which can be viewed as an overfitting problem: we compute the loss reduction on the out-of-bag instead of the in-bag training samples. Variable importance is not very well defined as a concept. Even for the case of a linear model with n observations, p variables and the standard n p situation, there is no theoretically defined variable importance metric in the sense of a parametric quantity that a variable importance estimator should try to estimate (Grömping, 2009).

Gini Impurity is named after the Italian statistician Corrado Gini. Gini impurity can be understood as a criterion to minimize the probability of misclassification. To understand the definition (as shown in the figure) and exactly how we can build up a decision tree, let's get started with a very simple data-set, where depending on various weather conditions, we decide whether to play an outdoor game or not. From the definition, a data-set containing only one class will have 0 Gini Impurity. In building up the decision tree our idea is to choose the feature with least Gini Impurity as root node and so on... Let's get started with the simple data-set -- Here we see that depending on 4 features (Outlook, Temperature, Humidity, Wind), decision is made on whether to play tennis or not.

Decision Trees are great and are useful for a variety of tasks. They form the backbone of most of the best performing models in the industry like XGboost and Lightgbm. But how do they work exactly? In fact, this is one of the most asked questions in ML/DS interviews. We generally know they work in a stepwise manner and have a tree structure where we split a node using some feature on some criterion.

Gatto, Joseph, Lanka, Ravi, Iwashita, Yumi, Stoica, Adrian

Have you ever wondered how your feature space is impacting the prediction of a specific sample in your dataset? In this paper, we introduce Single Sample Feature Importance (SSFI), which is an interpretable feature importance algorithm that allows for the identification of the most important features that contribute to the prediction of a single sample. When a dataset can be learned by a Random Forest classifier or regressor, SSFI shows how the Random Forest's prediction path can be utilized for low-level feature importance calculation. SSFI results in a relative ranking of features, highlighting those with the greatest impact on a data point's prediction. We demonstrate these results both numerically and visually on four different datasets.

Gini impurity is a measure of how often a randomly chosen element from the set would be incorrectly labeled if it was randomly labeled according to the distribution of labels in the subset. In simple terms, Gini impurity is the measure of impurity in a node. So to understand the formula a little better, let us talk specifically about the binary case where we have nodes with only two classes. So in the below five examples of candidate nodes labelled A-E and with the distribution of positive and negative class shown, which is the ideal condition to be in? I reckon you would say A or E and you are right.

We investigate how asymmetrizing an impurity function affects the choice of optimal node splits when growing a decision tree for binary classification. In particular, we relax the usual axioms of an impurity function and show how skewing an impurity function biases the optimal splits to isolate points of a particular class when splitting a node. We give a rigorous definition of this notion, then give a necessary and sufficient condition for such a bias to hold. We also show that the technique of class weighting is equivalent to applying a specific transformation to the impurity function, and tie all these notions together for a class of impurity functions that includes the entropy and Gini impurity. We also briefly discuss cost-insensitive impurity functions and give a characterization of such functions.