Today I would like to discuss on the most popular classification and regression algorithm (better to say a set of algorithms) named Decision Tree.
What is Decision tree?
By the name suggested in the model, it is a tree-based structural model to identify or classify or predict value of any outcome of an input dataset.
Key factors? Root node, leaf node, priority node, information gain and entropy.
As you can see, the above diagram is a tree based structural model with various nodes.
The starting node is called root node and the finishing nodes are called leaf nodes.
Now the question is that how to choose the root node among the list of predictor columns?
Here the concept of entropy and information gain come into play.
The entropy is a measure of impurity of a column based on the value of dependent column (1 or 0, yes or no) and the value of information gain is calculated based on the value of entropy.
The root node is decided based on the highest value of information gain of the concerned predictor columns.
Next, the concept is: Pruning.
To understand Pruning, you need to understand the concept of Bias and Variance, Overfitting and Underfitting, which I shall discuss in my next post.
Till then, stay tuned. ?