What is Multicollinearity (or Collinearlity) ?

Test your knowledge

What is Multicollinearity (or Collinearlity) ? - Quiz 1

1 / 3

What best describes perfect collinearity?

When the model tries to model the noise of the data.

When the model fits the data perfectly.

For 2 different linear combinations of >= 1 predictors, their Pearson correlation equal 1.

2 / 3

Does multicollinearity ruin the performance of linear regression?

Yes.

No.

3 / 3

Are collinearity and multicollinearity the same?

Yes.

No.

Your score is

Please rate this quiz

What is (Multi)Collinearity?

To better understand the definition of collinearity, let’s start with an example.

Here we have a dataset from an ice cream store. The dataset is collected over 4 days, with each row corresponds to 1 day. We have 3 predictor variables, those are: the temperature in Celsius, the temperature in Fahrenheit and humidity; along with 1 respond variable: ice cream sales.

Temperature (Celsius)	Temperature (Fahrenheit)	Humidity (%)	Ice-cream sales
10	50	50	150
18	64.4	80	220
24	75.2	70	370
18	64.4	60	280

There is something strange about this data: we have 2 fields that contain the same information (Temperature), just on 2 different scales (Celsius and Fahrenheit).
Well, that’s called Perfect Collinearity.

When we have 2 fields and the Pearson correlation between these 2 fields is exactly 1, we say that the dataset suffers from Perfect collinearity.

When we have 3 fields and 1 of them can be formed from a linear combination of the other 2, we also call that Perfect collinearity (e.g. imagine we have 3 predictors $x_1, x_2$ and $x_3$ such that $x_3 = 2x_1 - 5x_2$ ).

The same goes for cases when we have n fields.

Now we know what perfect collinearity is, it is not so hard to guess the definition of high collinearity. In the case of 2 fields, high collinearity happens when the Pearson correlation of these 2 fields is high. In the case of n fields, high collinearity happens when 1 field can be approximated by a linear combination of the other (n-1) fields.

When we say a dataset is suffering from collinearity, we usually mean it has perfect collinearity or high collinearity.

What is bad about collinearity?

The above dataset seems simple, maybe a simple model will work well on this data, sp let’s apply a Linear regressor on it. The result might be something like this:

$y' = 20x_1 + 0x_2 -3x_3 + 100$

where $y'$ is the predicted ice cream sales, $x_1$ is Temperature in Celsius, $x_2$ is Temperature in Fahrenheit and lastly, $x_3$ is humidity.

This linear regression model fits ideally with our data. The error, as you can see, is 0 (try to verify it by yourself!).

Okay, above regression equation is good, but there are other equations that can produce equally goodness, for example:

$\begin{aligned}y' &= x_1 + 19*\frac{x_2-32}{1.8} x_2 - 3x_3 + 100 \\ &= x_1 + \frac{19}{1.8} x_2 - 3x_3 - \frac{2140}{9}\end{aligned}$

You can see that the error of this model is also 0.

This gives us a big concern about evaluating the importance of predictor variables. In the first model, we have that $x_1$ is very important (its weight is 20 when $x_2$ ‘s weight is 0 and $x_3$ ‘s weight is 3). But in the second model, $x_2$ is overwhelming.

We may be misled by the models. If we look at model 1, we will see that: Oh, temperature (in Fahrenheit) does not play any role in predicting ice cream sales. This is obviously false. Same goes with model 2, we may look at it and think that temperature (in Celsius) is not so important. Different models make us think differently, and it seems like both our thoughts about the importance of features are false.

So we know what is bad about collinearity. Above, I take an extreme example of perfect collinearity, but the sense also applies to high collinearity. Collinearity makes our predictive model volatile, which, in turn, misleads us of the feature importance.

Note that, surprisingly, collinearity does NOT affect the performance of most models. As we can see in the above example, the error rates of both models are 0. A rare exception when multicollinearity worsens a model is in the case of Naive Bayes models, which require predictors to be independent of each other.

How to avoid collinearity?

Eliminate high-correlation columns. We can iterate through all columns and check the Person correlation amongst them. High (or perhaps very high) correlated columns should be taken care of. A drawback of this approach is that we can only test the correlation of each pair of columns, we can not detect the case when one column is correlated with a linear combination of $\geq$ 2 other columns (e.g. $x_1 = 2x_2 - 7x_3$ ). The good news is that those cases are much less likely to happen as the cases when 2 predictor variables are highly correlated to each other.
Use dimension reduction techniques. PCA is often the choice. The bad thing about this approach is that when we do dimension reduction, we also lose the ability to interpret the data.
Use Variance Inflation Factor (VIF). This is a technique that was specifically created for dealing with multicollinearity. Unlike using correlation – you can only see the relation of 1 column versus another, VIF can help us detecting multicollinearity in a 1-versus-all manner. What VIF does is basically to compute, for each predictor, how precisely it is predicted using a linear combination of the other predictors. That is, we regress a predictor using all other predictors, and get the $R^2$ . The higher the $R^2$ , the higher the level of multicollinearity this predictor has.

Summary

(Multi)Collinearity is the correlation of 1 predictor variable with a linear combination of $\geq$ 1 other predictor variables.

It is bad because it can mislead us about the importance of variables.

We can fix the problem by eliminating high-correlation columns, using dimension reduction techniques, VIF, etc. However, be noted that each method comes with its own drawbacks.

Questions:

1. I only care about the performance of my predictive model, should I pay attention to collinearity?

2. Perfect collinearity, high collinearity, low collinearity and no collinearity, which of them are completely good?

3. When a column can be formed by a non-linear combination of other columns (e.g. $x_1 = x_2*x_3$ ), is this called collinearity and is it bad?

Answers:

1. Probably NO. As I said, the bad thing about collinearity is it misleads us when we try to infer which predictor variables are good and which are bad.

2. Low collinearity and no collinearity are very good. You don’t need to do anything if this is the case. Perfect and high collinearity is what we should be concerned about.

3. Take a look at the name: collinearity. If the combination is non-linear, we don’t call it collinearity. However, multiplying 2 variables to create a new variable is a common technique to introduce nonlinearity into Linear Regression and in fact, the correlation between the new variable and each of the component variables is usually high. So, be careful.

Test your understanding

References:

Wikipedia about Multicollinearity: link
Investopedia’s definition of Multicollinearity: link
Understanding Regression Analysis, pp 176-180: link
Minitab’s article about Multicollinearity: link

Tung M Phung's Blog

What is Multicollinearity (or Collinearlity) ?

Leave a ReplyCancel reply