Over-fitting and Under-fitting are very frequently-used terms in Machine Learning. In this blog, we are going to examine everything we need to know about these 2 problems.
|Test your knowledge|
Over-fitting is the phenomenon when your model fits too much the training data, it even captures the noise of training data, but fails to predict new data – or testing data.
Let’s look at the first graph above. Note that the red dots are sample data, while the blue line is our predictive model. You can see that the line changes its slope very frequently – that is a sign of over-fitting because it cares too much about the little noise of the samples.
Test for over-fitting
The easiest way to know if your model is suffering from over-fitting is to compare its error on training data and testing data.
If the error rates are comparable, congratulation! Your model is free from over-fitting. However, if your error on the testing set is much higher than on the training set, you are having a problem with over-fitting, check the below solutions to solve this problem.
Question: Just for curiosity, what if my error on the testing set is significantly lower than the training set?
Answer: Well, this absolutely does not make sense. I suspect that your testing data contains only 1 or 2 samples, right?
A note here: double-check to ensure that the error rates you collected are well representations of the true error rates. For example, if your training size or testing size is too small, you will not get the correct performance of your model, hence misunderstand its fit-ability. Be sure to follow the guideline to split your data into training, validation and testing set.
Solutions for over-fitting
Assume you realized that your model is suffering from over-fitting, here are the methods to rescue it:
Collect more (training) data. Your model seems to be over-complex for your training dataset. By giving it more input data, your model will have more toys to play with, and thus less focus on the undesired little shred (noise).
Choose a simpler model/algorithm. As the problem is your model being too complex for the training data, a solution is to simplify it by using a simpler model. E.g. if you are using polynomial regression, give a try to simple linear regression.
Remove some bad features. By removing the features that have little to do with the response variable, you limit the noise that your model can capture. Note that to determine if a feature is good or bad is not so easy, you should take a look at feature selection methods. Also, check out my blogs about feature selection.
Ensemble methods – Bagging and Boosting. Ensemble means you combine (or aggregate) many models together. Bagging is the most common ensemble method – you train many different models with different parameters in parallel, and then combine their predictions to get the final result. For a regression problem, the final result will be a simple or weighted mean of predictions from the models. For a classification problem, we usually use a simple or weighted voting scheme to get the final result. Boosting also consists of many models, but sequentially, with the latter depends on the former one. Boosting focuses on training new models that work well on the cases that the previous ones worked badly on. Learn more about ensemble on this blog post.
Regularization. Regularization is a very good technique for dealing with over-fitting. Check out Regularization for Linear regression and Dropout (for Deep Learning).
Remove layers (for Deep Learning). Having more layers makes your networks more complex. Hence, reducing the number of layers is a good choice for lessening over-fitting.
Early Stopping (for iterative models, e.g. Deep Learning). The more the number of iterations, the more likely your model fits your training data. Normally, we stop iterating when more iterations do not lower the training error rate. With Early Stopping, we stop a little bit earlier – that is, when your testing error rate does not decrease with the increase of iterations; or when the rate of decreasing training error is too small.
Under-fitting happens when your model fails to capture the underlying trends of training data.
Look at the third graph above. When the actual trend of the data is a curve, an under-fitting model predicts a straight line.
Test for under-fitting
To check if your model under-fits, let’s examine the error rate. For under-fitting models, both training error and testing error are high, i.e. its performance is low.
Theoretically, if your model cannot give predictions perfectly (or very close to perfect), then it is suffering from under-fitting. But in practice, we know that in many cases, giving 100% correct result is nearly impossible. For example, in the case of recognizing hand-written digits (using MNIST dataset), a not-so-complicated model can give a 98%-true prediction, which is pretty close to perfect. But for cases involving human psychology, e.g. predicting the stock market, getting a significantly higher precision than random guess (50%) is pretty hard.
Solution for under-fitting
Re-check the pre-processing phase. There may be some errors in your code when you pre-process your data? Maybe you miss-handled the NULL (or NaN) values? You let your categorical feature take integer (1, 2, 3, etc) values, which makes your model think that it is numerical? Be sure you did your best to clean and pre-process your data.
Collect more (training) data. Yes! Having more (quality) data can help solve both over- and under-fitting. If you throw a coin 10 times and get 6 heads and 4 tails, it is hard to say if the coin is fair or biased. But if you throw 100 times and get 60 heads and 40 tails, there is quite strong evidence for the conclusion of a biased coin, right?
Choose a more complex model. As in the third graph above, a straight line (simple model) is not enough to capture the data because the actual trend follows a curve line. We have to change our model from straight to a curve-able one (make it more complex) to overcome under-fitting. If you are using simple linear regression, changing to polynomial regression may help.
Add more (good) features. By adding more strong features, you give your model a hand with understanding the data. For example, a linear model can not know it if your data follows a logarithm trend of a feature f. You can add a new feature which equals to your data, this will make a big change in your model’s performance.
Ensemble methods. Perfect! Ensembling can also help with under-fitting. Different model has strength in capturing different types of underlying trends. By aggregating them, you get all the advantages of the models. For example, you have a binary classification problem and want to ensemble 10 models. There is a very hard-to-predict sample x. Among the 10, only 2 of them well-capture the trend behind sample x and give the correct answers. The other 8 just flip a coin to guess the output, thus 4 output correct answer and 4 output wrong answer. In the end, we have 6 correct and 4 wrong predictions for x, and by the voting scheme, the answer given for x is correct. Very good!
A drawback of ensembling is it makes the result less interpretable. If you just need to enhance your model’s performance, ensembling is good. But if you need to explain the result to your boss, think twice before using it.
Reduce Regularization. If you are using regularization and put a high rate () on it, consider for reduction. Regularization is a good tool to deal with over-fitting, but it also makes your model prone to under-fitting (this is known as bias-variance trade-off).
Add more layers (for Deep learning). Adding more layers means increasing the space of your model, so it has room to be more complex.
Tune hyperparameters. Maybe the current set of hyperparameters is not so good.
About Bias-Variance Trade-off
Above, I tapped on this terminology, so I think I should make it clear now.
Bias is a property of predictive models, measures how badly a model captures the trends behind data.
Variance is another property, measures how much a model changes with respect to a slight change in the data.
Hence, Over-fitting is usually referred to as Low Bias – High Variance, while on the other hand, Under-fitting is High Bias – Low Variance.
Take another look at the solutions for Over- and Under-fitting I have enumerated, you can see that most of the solutions to cope with these 2 phenomenons are opposite. Thus, when we try to reduce over-fitting, we often undesirably increase under-fitting at the same time, and vice versa. This is called Bias-Variance Trade-off.
|Test your understanding|
To sum up,
|Definition||Your model captures not only actual trends but also the noise of the data.||Your model cannot capture some trends of data.|
|Indication||The large difference between training error and testing error.||High training error.|
|Solutions||1. Collect more (training) data.|
2. Choose a simpler model/algorithm.
3. Remove some bad features.
4. Ensemble methods.
6. Remove layers (for Deep Learning).
7. Early Stopping (for iterative models).
|1. Re-check the pre-processing phase.|
2. Collect more (training) data.
3. Choose a more complex model.
4. Add more (good) features.
5. Ensemble methods.
6. Reduce Regularization.
7. Add more layers (for Deep learning).
8. Tune hyperparameters.