What is Multicollinearity (or Collinearlity) ?

A beautiful sight
Test your knowledge
0%

What is Multicollinearity (or Collinearlity) ? - Quiz 1

1 / 3

Does multicollinearity ruin the performance of linear regression?

2 / 3

What best describes perfect collinearity?

3 / 3

Are collinearity and multicollinearity the same?

Your score is

0%

Please rate this quiz

What is (Multi)Collinearity?

To better understand the definition of collinearity, let’s start with an example.

Here we have a dataset from an ice cream store. The dataset is collected over 4 days, with each row corresponds to 1 day. We have 3 predictor variables, those are: the temperature in Celsius, the temperature in Fahrenheit and humidity; along with 1 respond variable: ice cream sales.

Temperature (Celsius)Temperature (Fahrenheit)Humidity (%)Ice-cream sales
105050150
1864.480220
2475.270370
1864.460280

There is something strange about this data: we have 2 fields that contain the same information (Temperature), just on 2 different scales (Celsius and Fahrenheit).
Well, that’s called Perfect Collinearity.

When we have 2 fields and the Pearson correlation between these 2 fields is exactly 1, we say that the dataset suffers from Perfect collinearity.

When we have 3 fields and 1 of them can be formed from a linear combination of the other 2, we also call that Perfect collinearity (e.g. imagine we have 3 predictors x_1, x_2 and x_3 such that x_3 = 2x_1 - 5x_2).

The same goes for cases when we have n fields.

Now we know what perfect collinearity is, it is not so hard to guess the definition of high collinearity. In the case of 2 fields, high collinearity happens when the Pearson correlation of these 2 fields is high. In the case of n fields, high collinearity happens when 1 field can be approximated by a linear combination of the other (n-1) fields.

When we say a dataset is suffering from collinearity, we usually mean it has perfect collinearity or high collinearity.

What is bad about collinearity?

The above dataset seems simple, maybe a simple model will work well on this data, sp let’s apply a Linear regressor on it. The result might be something like this:

y' = 20x_1 + 0x_2 -3x_3 + 100

where y' is the predicted ice cream sales, x_1 is Temperature in Celsius, x_2 is Temperature in Fahrenheit and lastly, x_3 is humidity.

This linear regression model fits ideally with our data. The error, as you can see, is 0 (try to verify it by yourself!).

Okay, above regression equation is good, but there are other equations that can produce equally goodness, for example:

\begin{aligned}y' &= x_1 + 19*\frac{x_2-32}{1.8} x_2 - 3x_3 + 100 \\    &= x_1 + \frac{19}{1.8} x_2 - 3x_3 - \frac{2140}{9}\end{aligned}

You can see that the error of this model is also 0.

This gives us a big concern about evaluating the importance of predictor variables. In the first model, we have that x_1 is very important (its weight is 20 when x_2‘s weight is 0 and x_3‘s weight is 3). But in the second model, x_2 is overwhelming.

We may be misled by the models. If we look at model 1, we will see that: Oh, temperature (in Fahrenheit) does not play any role in predicting ice cream sales. This is obviously false. Same goes with model 2, we may look at it and think that temperature (in Celsius) is not so important. Different models make us think differently, and it seems like both our thoughts about the importance of features are false.

So we know what is bad about collinearity. Above, I take an extreme example of perfect collinearity, but the sense also applies to high collinearity. Collinearity makes our predictive model volatile, which, in turn, misleads us of the feature importance.

Note that, surprisingly, collinearity does NOT affect the performance of most models. As we can see in the above example, the error rates of both models are 0. A rare exception when multicollinearity worsens a model is in the case of Naive Bayes models, which require predictors to be independent of each other.

How to avoid collinearity?

  • Eliminate high-correlation columns. We can iterate through all columns and check the Person correlation amongst them. High (or perhaps very high) correlated columns should be taken care of. A drawback of this approach is that we can only test the correlation of each pair of columns, we can not detect the case when one column is correlated with a linear combination of \geq 2 other columns (e.g. x_1 = 2x_2 - 7x_3). The good news is that those cases are much less likely to happen as the cases when 2 predictor variables are highly correlated to each other.
  • Use dimension reduction techniques. PCA is often the choice. The bad thing about this approach is that when we do dimension reduction, we also lose the ability to interpret the data.
  • Use Variance Inflation Factor (VIF). This is a technique that was specifically created for dealing with multicollinearity. Unlike using correlation – you can only see the relation of 1 column versus another, VIF can help us detecting multicollinearity in a 1-versus-all manner. What VIF does is basically to compute, for each predictor, how precisely it is predicted using a linear combination of the other predictors. That is, we regress a predictor using all other predictors, and get the R^2. The higher the R^2, the higher the level of multicollinearity this predictor has.

Summary

(Multi)Collinearity is the correlation of 1 predictor variable with a linear combination of \geq 1 other predictor variables.

It is bad because it can mislead us about the importance of variables.

We can fix the problem by eliminating high-correlation columns, using dimension reduction techniques, VIF, etc. However, be noted that each method comes with its own drawbacks.


Questions:

1. I only care about the performance of my predictive model, should I pay attention to collinearity?

2. Perfect collinearity, high collinearity, low collinearity and no collinearity, which of them are completely good?

3. When a column can be formed by a non-linear combination of other columns (e.g. x_1 = x_2*x_3), is this called collinearity and is it bad?

Answers:

1. Probably NO. As I said, the bad thing about collinearity is it misleads us when we try to infer which predictor variables are good and which are bad.

2. Low collinearity and no collinearity are very good. You don’t need to do anything if this is the case. Perfect and high collinearity is what we should be concerned about.

3. Take a look at the name: collinearity. If the combination is non-linear, we don’t call it collinearity. However, multiplying 2 variables to create a new variable is a common technique to introduce nonlinearity into Linear Regression and in fact, the correlation between the new variable and each of the component variables is usually high. So, be careful.

Test your understanding
0%

What is Multicollinearity (or Collinearlity) ? - Quiz 2

1 / 5

How can we deal with the problem of collinearity? Choose all that apply.

2 / 5

Is a higher-order correlation between 2 variables considered high-collinearity?

3 / 5

What best describes the Variance Inflation Factor (VIF)? Choose all that apply.

4 / 5

Given a dataset about big companies in the US. Amongst all, there are 2 variables, one is the Total Assets reported by Bloomberg, the other is also the Total Assets but given by Reuters. We know that Reuters and Bloomberg are both reputable sources of information. What statement is true?

5 / 5

When we use a decision tree as our predictive model, how can it be affected by high multicollinearity?

Your score is

0%

Please rate this quiz

References:

  • Wikipedia about Multicollinearity: link
  • Investopedia’s definition of Multicollinearity: link
  • Understanding Regression Analysis, pp 176-180: link
  • Minitab’s article about Multicollinearity: link

Leave a Reply