Let us try to explain today’s topic without delving into statistics.
Consider that we are analysing height of men for a population of m number of dataset.
Now, if we are Indian, we shall attempt to collect data from Indian men because it is the easiest way to collect. Am I correct?
After collecting a good amount of data, we shall probably start collecting some data from other countries, but not much, as connecting with foreign nationals would be difficult.
Then, we would start doing all kinds of data analysis based on the collected data. Now, tell me what would we get? Don’t you think the data tends to be more aligned with Indian men? This is what is is known as Biased data and the bias, in this case, is towards Indians.
Now, if we have a good amount of Bias in data then there would not be a good amount of variety. So we can conclude that the above mentioned data set has high bias and low Variance and it is not good for our analysis.
Next, we have mixed data from various regions and includes a number of parameters in the dataset to meet the requirements for each country.
We have very less bias but high variance. What do you think? Is it good for analysis?
Actually, when we have very high variance then the model will be very complex with high computational costs and the predicted data shall return different values for each run of the model.
So, one must have a good balance of bias and variance in the data set.
I hope that the above picture is clear enough to explain the relationship between bias and variance.
Now, if one has high bias and low variance, the model is known as an underfitting model. For low bias but high variance, the model is called overfitting. Why?
Check the below diagram.
One can clearly see that the lefthand-side model is less complex than the righthand-side model. I hope you can now understand the concept of bias / variance, overfitting and underfitting data sets.
Let’s continue with our topic on decision tree.
After preparing the model of decision tree with a good amount of predictor variables, it will look like below: (righthand-side model)
But if one chooses very less predictor variables, then the model will be converted into an underfitted model.
If we choose the proper bias and variance then the model will become a good model (middle one).
For an overcorrected model, the model will run fine and with good accuracy for only the training data; but if we test the same model with another set of records, then the accuracy level will go down due to high variance.
Hope it is clear now. Stay tuned for my next post. ?