|Test your knowledge|
What is (Multi)Collinearity?
To better understand the definition of collinearity, let’s start with an example.
Here we have a dataset from an ice cream store. The dataset is collected over 4 days, with each row corresponds to 1 day. We have 3 predictor variables, those are: the temperature in Celsius, the temperature in Fahrenheit and humidity; along with 1 respond variable: ice cream sales.
|Temperature (Celsius)||Temperature (Fahrenheit)||Humidity (%)||Ice-cream sales|
There is something strange about this data: we have 2 fields that contain the same information (Temperature), just on 2 different scales (Celsius and Fahrenheit).
Well, that’s called Perfect Collinearity.
When we have 2 fields and the Pearson correlation between these 2 fields is exactly 1, we say that the dataset suffers from Perfect collinearity.
When we have 3 fields and 1 of them can be formed from a linear combination of the other 2, we also call that Perfect collinearity (e.g. imagine we have 3 predictors and such that ).
The same goes for cases when we have n fields.
Now we know what perfect collinearity is, it is not so hard to guess the definition of high collinearity. In the case of 2 fields, high collinearity happens when the Pearson correlation of these 2 fields is high. In the case of n fields, high collinearity happens when 1 field can be approximated by a linear combination of the other (n-1) fields.
When we say a dataset is suffering from collinearity, we usually mean it has perfect collinearity or high collinearity.
What is bad about collinearity?
The above dataset seems simple, maybe a simple model will work well on this data, sp let’s apply a Linear regressor on it. The result might be something like this:
where is the predicted ice cream sales, is Temperature in Celsius, is Temperature in Fahrenheit and lastly, is humidity.
This linear regression model fits ideally with our data. The error, as you can see, is 0 (try to verify it by yourself!).
Okay, above regression equation is good, but there are other equations that can produce equally goodness, for example:
You can see that the error of this model is also 0.
This gives us a big concern about evaluating the importance of predictor variables. In the first model, we have that is very important (its weight is 20 when ‘s weight is 0 and ‘s weight is 3). But in the second model, is overwhelming.
We may be misled by the models. If we look at model 1, we will see that: Oh, temperature (in Fahrenheit) does not play any role in predicting ice cream sales. This is obviously false. Same goes with model 2, we may look at it and think that temperature (in Celsius) is not so important. Different models make us think differently, and it seems like both our thoughts about the importance of features are false.
So we know what is bad about collinearity. Above, I take an extreme example of perfect collinearity, but the sense also applies to high collinearity. Collinearity makes our predictive model volatile, which, in turn, misleads us of the feature importance.
Note that, surprisingly, collinearity does NOT affect the performance of most models. As we can see in the above example, the error rates of both models are 0. A rare exception when multicollinearity worsens a model is in the case of Naive Bayes models, which require predictors to be independent of each other.
How to avoid collinearity?
(Multi)Collinearity is the correlation of 1 predictor variable with a linear combination of 1 other predictor variables.
It is bad because it can mislead us about the importance of variables.
We can fix the problem by eliminating high-correlation columns, using dimension reduction techniques, VIF, etc. However, be noted that each method comes with its own drawbacks.
1. I only care about the performance of my predictive model, should I pay attention to collinearity?
2. Perfect collinearity, high collinearity, low collinearity and no collinearity, which of them are completely good?
3. When a column can be formed by a non-linear combination of other columns (e.g. ), is this called collinearity and is it bad?
1. Probably NO. As I said, the bad thing about collinearity is it misleads us when we try to infer which predictor variables are good and which are bad.
2. Low collinearity and no collinearity are very good. You don’t need to do anything if this is the case. Perfect and high collinearity is what we should be concerned about.
3. Take a look at the name: collinearity. If the combination is non-linear, we don’t call it collinearity. However, multiplying 2 variables to create a new variable is a common technique to introduce nonlinearity into Linear Regression and in fact, the correlation between the new variable and each of the component variables is usually high. So, be careful.
|Test your understanding|