Advantages:
- Non-Linearity : This means that the model can deal with continuous streams of data, this refers to the data in which any two values may have infinite number of values between them.
- Robustness to Outliers : This means that the decision tree regression model, unlike other regression models can deal with extreme values much more effectively as compared to other regression models. This is due to the splitting criterion of the regression model.
- Handling Missing Values : Decision Tree methods can deal with missing values by considering alternate branches based on available data.
Disadvantages:
- Overfitting : This means that the model might fit too closely to the training data set values due to lack of enough data. What this would do is provide inaccurate predictions. The model will not be able to accurately predict any new values apart from the training dataset.
- Instability : This is due to the fact that even a small change in the dataset can cause a large scale change in the decision trees formed. This makes decision trees insensitive to changes in the dataset.
- Lack of continuity : Since the decision trees divide the data into disjoint regions and they provide constant predictions for each of these sections, the model is not well suited when dealing with continuous target variables.
The decision of what is to be made the root node is decided using the concept of entropy and information gain.
The entropy is the measure of impurity in data. For ex, at a node in the decision tree the more homogenous the data is (i.e the data is of the same type or category) the lower the entropy of that node is. We reach the leaf node by considering the entropy of the nodes. The node when has the maximum homogenous data in it, is considered as a leaf node. The following formula is used to calculate the entropy of the data.
Here E(S) is the entropy. pi is the probability of randomly choosing data from a class.
Ex: If we have a dataset of 10 observations and 6 of those observations are YES and 4 of them are NO their entropy is calculated as follows:
here pi for YES = 6/10 i.e probability of choosing YES from the dataset randomly. similarly pi for NO = 4/10 Here if all the observations belonged to one type of data say YES then the entropy would have been 0, similarly if the observation consisted half YESes and half NOs then the entropy of the data would have been 1.
The information gain is the deciding factor of what is to be chosen as the root node. It is the measure of reduction in impurity at any node. The Feature that has the highest information gain is used as the root node. That root node is then further divided and the dividing factors are decided by calculation information gain of subgroups of the feature chosen as the root node.
The formula for information gain is given as follows:
I.G = EntropyOfDataset — WeightedAverageOfSubgroups.
for ex. while creating a decision tree, we chose weather as a root node. Now this root node consists three categories: sunny, hot, windy. these three categories will be subdivided based on information gain calculated for each of these features. What weighted average means here is: It is the Entropy of the subgroup multiplied by the (size of the subgroup / size of the parent).
For a detailed explanation refer this video : https://youtu.be/CWzpomtLqqs?feature=shared