One of the advantages of using a Decision tree is that it efficiently identifies the most significant variable and splits the population on it. In previous article, we developed a high-level understanding of Decision trees. In this article, we will focus on the science behind splitting the nodes and choosing the most significant split.
Decision trees can use various algorithms to split a node into two or more sub-nodes. The creation of sub-nodes increases the homogeneity of resultant sub-nodes. In other words, we can say that purity of the node increases with respect to the target variable. The decision tree splits the nodes on all available variables and then selects the split which results in the most homogeneous sub-nodes.
The algorithm selection is also based on the type of target variables. Let’s look at the four most commonly used algorithms in the decision tree:
Gini index says, if we select two items from a population at random then they must be of the same class and the probability for this is 1 if the population is pure.
It works with the categorical target variable “Success” or “Failure”.It performs only Binary splitsHigher the value of Gini higher the homogeneity.CART (Classification and Regression Tree) uses the Gini method to create binary splits.
Steps to Calculate Gini for a split
Calculate Gini for sub-nodes, using the formula sum of the square of probability for success and failure (p^2+q^2). Calculate Gini for split using weighted Gini score of each node of that split
Example: – Referring to the example used in the previous article, where we want to segregate the students based on the target variable of playing cricket or not. In the below snapshot, we split the population using two input variables Gender and Class. Now I want to identify which split is producing more homogeneous sub-nodes using the Gini index.
Split on Gender:-
Calculate, Gini for sub-node Female = (0.2)*(0.2)+(0.8)*(0.8)=0.68Gini for sub-node Male = (0.65)*(0.65)+(0.35)*(0.35)=0.55Calculate weighted Gini for Split Gender = (10/30)*0.68+(20/30)*0.55 = 0.59
Similar for Split on Class:-
Gini for sub-node Class IX = (0.43)*(0.43)+(0.57)*(0.57)=0.51Gini for sub-node Class X = (0.56)*(0.56)+(0.44)*(0.44)=0.51Calculate weighted Gini for Split Class = (14/30)*0.51+(16/30)*0.51 = 0.51
Above, you can see that the Gini score for Split on Gender is higher than Class so the node will split on Gender.
It is an algorithm to find out the statistical significance between the differences between sub-nodes and parent node. We measure it by the sum of squares of standardized differences between observed and expected frequencies of the target variable.
It works with the categorical target variable “Success” or “Failure”.It can perform two or more splitsHigher the value of Chi-Square higher the statistical significance of differences between sub-node and Parent node. Chi-Square of each node is calculated using the formula, Chi-square = ((Actual – Expected)^2 / Expected)^1/2It generates a tree called CHAID (Chi-square Automatic Interaction Detector)
Steps to Calculate Chi-square for a split:
Calculate Chi-square for an individual node by calculating the deviation for Success and Failure both calculated Chi-square of Split using Sum of all Chi-square of success and Failure of each node of the split
Example: Let’s work with the above example that we have used to calculate Gini.
Split on Gender:
First, we are populating for node Female, Populate the actual value for “Play Cricket” and “Not Play Cricket”, here these are 2 and 8 respectively.
Calculate expected value for “Play Cricket” and “Not Play Cricket”, here it would be 5 for both because the parent node has a probability of 50% and we have applied the same probability on Female count(10).
Calculate deviations by using the formula, Actual – Expected. It is for “Play Cricket” (2 – 5 = -3) and for “Not play cricket” ( 8 – 5 = 3).
Calculate Chi-square of nodes for “Play Cricket” and “Not Play Cricket” using a formula with formula, = ((Actual – Expected)^2 / Expected)^1/2. You can refer to below table for calculation.
Follow similar steps for calculating the Chi-square value for the Male node. Now add all Chi-square values to calculate Chi-square for split Gender.
Split on Class:
Perform similar steps of calculation for a split on Class and you will come up with the below table.
Above, you can see that Chi-square also identifies that the Gender split is more significant compare to Class.
Let’s look at the image below and think about which node can be described easily. I am sure, your answer is C because it requires less information as all values are similar where as B requires more information to describe it and A would require even more. You can say in other words also that C is a Pure node, B is less Impure and A is more impure.
Now, we can build a conclusion that a less impure node requires less information to describe it and a more impure node requires more information. Information theory has a measure to define this degree of disorganization in a system, which is called Entropy. If the sample is completely homogeneous, then the entropy is zero, and if the sample is equally divided it has an entropy of one.
Entropy can be calculated using the formula:-
Here p and q are the probability of success and failure respectively in that node. Entropy is also used with a categorical target variable. It chooses the split which has the lowest entropy compared to the parent node and other splits.
Steps to calculate entropy for a split:
Calculate entropy of parent nodeCalculate entropy of each individual node of split and calculate the weighted average of all sub-nodes available in the split.
Example: Let’s use this method to identify the best split for student example.
Entropy for parent node = -(15/30) log2 (15/30) – (15/30) log2 (15/30) = 1. Here 1 shows that it is a impure node.Entropy for Female node = -(2/10) log2 (2/10) – (8/10) log2 (8/10) = 0.72 and for male node, -(13/20) log2 (13/20) – (7/20) log2 (7/20) = 0.93.Entropy for split Gender = Weighted entropy of sub-nodes = (10/30)*0.72 + (20/30)*0.93 = 0.86Entropy for Class IX node, -(6/14) log2 (6/14) – (8/14) log2 (8/14) = 0.99 and for Class X node, -(9/16) log2 (9/16) – (7/16) log2 (7/16) = 0.99.Entropy for split Class = (14/30)*0.99 + (16/30)*0.99 = 0.99
Above you can see that entropy of split on Gender is lower compare to Class so we will again go with split Gender. We can derive information gain from entropy as 1- Entropy.
Reduction in Variance:
Till now, we have discussed the algorithms for the categorical target variable. Reduction in Variance is an algorithm for the continuous target variable. This algorithm uses the same formula of variance to choose the right split that we went through the descriptive statistics. The split with lower variance is selected as the criteria to split the population:
Above X-bar is the mean of the values, X is actual and n is the number of values.
Steps to calculate Variance:
Calculate variance for each node. Calculate Variance for each split as the weighted average of each node variance
Example:- Let’s assign numerical value 1 for play cricket and 0 for not playing cricket. Now follow the steps to identify the right split:
Variance for Root node, here mean value is (15*1 + 15*0)/30 = 0.5 and we have 15 one and 15 zero. Now variance would be ((1-0.5)^2+(1-0.5)^2+….15 times+(0-0.5)^2+(0-0.5)^2+…15 times) / 30, this can be written as (15*(1-0.5)^2+15*(0-0.5)^2) / 30 = 0.25Mean of Female node = (2*1+8*0)/10=0.2 and Variance = (2*(1-0.2)^2+8*(0-0.2)^2) / 10 = 0.16Mean of Male Node = (13*1+7*0)/20=0.65 and Variance = (13*(1-0.65)^2+7*(0-0.65)^2) / 20 = 0.23Variance for Split Gender = Weighted Variance of Sub-nodes = (10/30)*0.16 + (20/30) *0.23 = 0.21Mean of Class IX node = (6*1+8*0)/14=0.43 and Variance = (6*(1-0.43)^2+8*(0-0.43)^2) / 14= 0.24Mean of Class X node = (9*1+7*0)/16=0.56 and Variance = (9*(1-0.56)^2+7*(0-0.56)^2) / 16 = 0.25Variance for Split Gender = (14/30)*0.24 + (16/30) *0.25 = 0.25
Above, you can see that the Gender split has lower variance compared to the parent node so the split would be on Gender only.
Above, we have looked at various algorithms to split a node into sub-nodes. Now to create a decision tree, sub-nodes are further split into two or more sub-nodes and all input variables are considered for creating the split again. Fields already involved in the split also get considered for the split. It is a recursive process and it stops if the node ends up as a pure node or it reaches the maximum depth of the tree or the number of records in the node reaches the preset limit.
In an extreme scenario, a decision tree can have a number of nodes equal to the total number of observations, but that would be a very complex tree. If we are expanding the decision tree towards more complexity based on the training data set, then it causes overfitting and losses the predictive power of the model because it is not generalized. Overfitting can be removed by pruning the nodes.