Overfitting and Underfitting are very frequentlyused terms in Machine Learning. In this blog, we are going to examine everything we need to know about these 2 problems.
Test your knowledge 

Overfitting
Definition
Overfitting is the phenomenon when your model fits too much the training data, it even captures the noise of training data, but fails to predict new data – or testing data.
Let’s look at the first graph above. Note that the red dots are sample data, while the blue line is our predictive model. You can see that the line changes its slope very frequently – that is a sign of overfitting because it cares too much about the little noise of the samples.
Test for overfitting
The easiest way to know if your model is suffering from overfitting is to compare its error on training data and testing data.
If the error rates are comparable, congratulation! Your model is free from overfitting. However, if your error on the testing set is much higher than on the training set, you are having a problem with overfitting, check the below solutions to solve this problem.
Question: Just for curiosity, what if my error on the testing set is significantly lower than the training set?
Answer: Well, this absolutely does not make sense. I suspect that your testing data contains only 1 or 2 samples, right?
A note here: doublecheck to ensure that the error rates you collected are well representations of the true error rates. For example, if your training size or testing size is too small, you will not get the correct performance of your model, hence misunderstand its fitability. Be sure to follow the guideline to split your data into training, validation and testing set.
Solutions for overfitting
Assume you realized that your model is suffering from overfitting, here are the methods to rescue it:
Collect more (training) data. Your model seems to be overcomplex for your training dataset. By giving it more input data, your model will have more toys to play with, and thus less focus on the undesired little shred (noise).
Choose a simpler model/algorithm. As the problem is your model being too complex for the training data, a solution is to simplify it by using a simpler model. E.g. if you are using polynomial regression, give a try to simple linear regression.
Remove some bad features. By removing the features that have little to do with the response variable, you limit the noise that your model can capture. Note that to determine if a feature is good or bad is not so easy, you should take a look at feature selection methods. Also, check out my blogs about feature selection.
Ensemble methods – Bagging and Boosting. Ensemble means you combine (or aggregate) many models together. Bagging is the most common ensemble method – you train many different models with different parameters in parallel, and then combine their predictions to get the final result. For a regression problem, the final result will be a simple or weighted mean of predictions from the models. For a classification problem, we usually use a simple or weighted voting scheme to get the final result. Boosting also consists of many models, but sequentially, with the latter depends on the former one. Boosting focuses on training new models that work well on the cases that the previous ones worked badly on. Learn more about ensemble on this blog post.
Regularization. Regularization is a very good technique for dealing with overfitting. Check out Regularization for Linear regression and Dropout (for Deep Learning).
Remove layers (for Deep Learning). Having more layers makes your networks more complex. Hence, reducing the number of layers is a good choice for lessening overfitting.
Early Stopping (for iterative models, e.g. Deep Learning). The more the number of iterations, the more likely your model fits your training data. Normally, we stop iterating when more iterations do not lower the training error rate. With Early Stopping, we stop a little bit earlier – that is, when your testing error rate does not decrease with the increase of iterations; or when the rate of decreasing training error is too small.
Underfitting
Definition
Underfitting happens when your model fails to capture the underlying trends of training data.
Look at the third graph above. When the actual trend of the data is a curve, an underfitting model predicts a straight line.
Test for underfitting
To check if your model underfits, let’s examine the error rate. For underfitting models, both training error and testing error are high, i.e. its performance is low.
Theoretically, if your model cannot give predictions perfectly (or very close to perfect), then it is suffering from underfitting. But in practice, we know that in many cases, giving 100% correct result is nearly impossible. For example, in the case of recognizing handwritten digits (using MNIST dataset), a notsocomplicated model can give a 98%true prediction, which is pretty close to perfect. But for cases involving human psychology, e.g. predicting the stock market, getting a significantly higher precision than random guess (50%) is pretty hard.
Solution for underfitting
Recheck the preprocessing phase. There may be some errors in your code when you preprocess your data? Maybe you misshandled the NULL (or NaN) values? You let your categorical feature take integer (1, 2, 3, etc) values, which makes your model think that it is numerical? Be sure you did your best to clean and preprocess your data.
Collect more (training) data. Yes! Having more (quality) data can help solve both over and underfitting. If you throw a coin 10 times and get 6 heads and 4 tails, it is hard to say if the coin is fair or biased. But if you throw 100 times and get 60 heads and 40 tails, there is quite strong evidence for the conclusion of a biased coin, right?
Choose a more complex model. As in the third graph above, a straight line (simple model) is not enough to capture the data because the actual trend follows a curve line. We have to change our model from straight to a curveable one (make it more complex) to overcome underfitting. If you are using simple linear regression, changing to polynomial regression may help.
Add more (good) features. By adding more strong features, you give your model a hand with understanding the data. For example, a linear model can not know it if your data follows a logarithm trend of a feature f. You can add a new feature which equals to your data, this will make a big change in your model’s performance.
Ensemble methods. Perfect! Ensembling can also help with underfitting. Different model has strength in capturing different types of underlying trends. By aggregating them, you get all the advantages of the models. For example, you have a binary classification problem and want to ensemble 10 models. There is a very hardtopredict sample x. Among the 10, only 2 of them wellcapture the trend behind sample x and give the correct answers. The other 8 just flip a coin to guess the output, thus 4 output correct answer and 4 output wrong answer. In the end, we have 6 correct and 4 wrong predictions for x, and by the voting scheme, the answer given for x is correct. Very good!
A drawback of ensembling is it makes the result less interpretable. If you just need to enhance your model’s performance, ensembling is good. But if you need to explain the result to your boss, think twice before using it.
Reduce Regularization. If you are using regularization and put a high rate () on it, consider for reduction. Regularization is a good tool to deal with overfitting, but it also makes your model prone to underfitting (this is known as biasvariance tradeoff).
Add more layers (for Deep learning). Adding more layers means increasing the space of your model, so it has room to be more complex.
Tune hyperparameters. Maybe the current set of hyperparameters is not so good.
About BiasVariance Tradeoff
Above, I tapped on this terminology, so I think I should make it clear now.
Bias is a property of predictive models, measures how badly a model captures the trends behind data.
Variance is another property, measures how much a model changes with respect to a slight change in the data.
Hence, Overfitting is usually referred to as Low Bias – High Variance, while on the other hand, Underfitting is High Bias – Low Variance.
Take another look at the solutions for Over and Underfitting I have enumerated, you can see that most of the solutions to cope with these 2 phenomenons are opposite. Thus, when we try to reduce overfitting, we often undesirably increase underfitting at the same time, and vice versa. This is called BiasVariance Tradeoff.
Test your understanding 

To sum up,
Overfitting  Underfitting  
Example  
Definition  Your model captures not only actual trends but also the noise of the data.  Your model cannot capture some trends of data. 
Indication  The large difference between training error and testing error.  High training error. 
Solutions  1. Collect more (training) data. 2. Choose a simpler model/algorithm. 3. Remove some bad features. 4. Ensemble methods. 5. Regularization. 6. Remove layers (for Deep Learning). 7. Early Stopping (for iterative models).  1. Recheck the preprocessing phase. 2. Collect more (training) data. 3. Choose a more complex model. 4. Add more (good) features. 5. Ensemble methods. 6. Reduce Regularization. 7. Add more layers (for Deep learning). 8. Tune hyperparameters. 