|Test your knowledge|
When reading through others’ notebooks, I see many people have a tendency to always do feature centering, scaling or normalizing right before applying predictive models to the data. Even though transforming the features that way will not likely worsen models’ performance, is it truly helpful?
In this blog post, I will list out the cases in which feature centering, scaling, normalizing are really beneficial.
Those cases are:
1. Taking the sum (or average) of some predictor variables
When doing feature engineering, sometimes we should extract a new feature based on the existing ones, and the most common way is to take the sum or average of some features.
For example, to predict houses price, you would like to take the sum of the area of all floors. Currently, for all from the first floor onward, they are measured in , while the basement is measured by how many cars it can contain. It makes no sense at all to sum over these values, while the units of measure are different. Hence, we should, in advance, scale the basement’s value so that it looks like being measured in . Assuming a car’s needed space for parking is a rectangle with a width of 3.2m and a length of 6.2m, the area it takes is 3.2 * 6.2 = 19.84 . Thus, we re-scale the basements’ value by multiplying it with 19.84 to have the unit as instead of the number of cars.
The key takeaway is: when we have some interactions between variables, be careful about the units of those variables.
2. Multiplying variables
When you think your predictor X1 has a curve-linear relationship with your response variable, you can, for example, create a new predictor X1_2 whose values are just the power of X1. While this new predictor does contribute more predictive power, it also makes the data suffer from high multicollinearity. The same problem also occurs when you derive a new feature by multiplying 2 different predictors, says, create X3 = X1*X2.
To solve this problem, we can center the variables by subtracting the means (or if you prefer, median). This demean step should be executed before we attempt to multiply the variables.
Notice that by demeaning, the new variable (says, X3) will have a very different meaning compared to its version when no demean is done. That explains why the problem of high multicollinearity is solved.
3. Predictive models that have feature-interaction in nature
Some models like Linear Regression and Decision Tree examine each predictor variable separately, thus it is not required to do centering or scaling on the predictors when using these algorithms. However, some others, like K-means and K-nearest Neighbor do take into account the predictors together. For these 2 models, a “distance” is computed for each pair of sample points, by summing up the difference in each predictor value, usually using Euclidean or Hamming distance. Imagine if your variable X1 has a small value range (e.g. [0, 10]) and X2 has a much bigger range (e.g. [100, 1000000]), the distance of 2 sample points will be mostly dependent on X2, which is not fair for the pathetic X1. A simple solution for this is to scale all the variables into the same range (usually [0, 1]), so the algorithms will treat them more equally.
4. Linear Regression and feature importance
An advantage of Linear Regression is it supports estimating the importance of each predictor variable in the prediction process. A predictor variable’s importance is represented by its corresponding weight (or says, coefficient).
However, if the value ranges of the predictors are not the same, e.g. X1 has a range [1, 5] while X2 is in [1000, 10000], the coefficient of X1 will mostly be higher and the one for X2 will be lower.
To make the coefficients more legitimate inference of importance, all variables should be, more or less, scaled to the same range.
5. Gradient Descent
Empirical experiments show that when fitting a Neural Network to data, it will converge significantly faster if the data were normalized (i.e. centered and scaled, standardized). This fact is interesting since Neural Networks treat each predictor on its own, that is, each predictor is examined irrespective of the others. So why does normalizing the predictor make a difference?
The answer lies in the Gradient Descent process.
Firstly, the activation functions. High-level speaking, most of the activation functions work best when the input value is moderate (i.e. not too small and not too high). For example, let’s look at the Sigmoid function (i.e. Logistic function) below.
When the input value is extreme, the gradient is nearly flat, which makes learning very very slow. Instead, if the value is in the “effective range” (I would say, from -4 to 4), the gradient is steep, thus the speed of learning is accelerated and the whole training time will be shorter.
Secondly, normalization, or says, mostly because of feature centering, the errors are more roundly distributed, while a dataset which is non-zero centered will likely be more eclipse-like. A round error distribution allows the weights to converge after some comparable numbers of epochs. On the other hand, in the case of eclipse shape, some weights probably converge significantly faster, while the others are slow. Furthermore, since the convergence state only exists when all the weights attempt to stop simultaneously, the slow weights will make the fast ones continue to fluctuate and lengthens the process.
In this blog post, we discuss 5 situations when we need to do centering, scaling, normalizing for our predictor variables.
In most cases, centering and scaling are safe, which means these transformations don’t affect the intrinsic meaning of variables, and hence the predictive power of the data is not reduced – this seems to be the reason why many researchers assume a normalization step by default. An exception is when we multiply 2 (or more) variables to make a new one, a centering operation does change how we interpret the new variable, so please be careful.
|Test your understanding|
- Geoffrey Hinton’s Neural Networks class, lecture 6: link