Test your knowledge |
|
Feature selection is hard but very important. By focusing on the right set of features, our model will be less prone to misdirection and has more chance to tackle the truth behind data.
When feature extraction mostly depends on our domain-knowledge (which needs time and efforts), feature selection, on the other hand, relaxes our shoulders because it can be handled quite well using standard methods (though we will probably do feature selection better if we do have domain-knowledge).
That is why I’m writing this post: to list some well-known and efficient techniques for Feature Selection.
Sklearn: SelectKBest
First of all, let us introduce SelectKBest. This is not a method to score features, but a nice function from Sklearn to help us select exactly k features from a set of features, according to a score. We will use this in the below examples when the methods are proposed.
For now, let’s import some useful libraries and create 2 data frames, one has a numerical target and the other has a categorical target.
import numpy as np import pandas as pd df_reg = pd.DataFrame({'num1' : [1, 1.2, 5, 8.2, 4, 3.5],\ 'num2' : [6, 2, 7, 3, 9, 8], \ 'num3' : [7, 0, 6, 4, 3, 2], \ 'num4' : [4, 3, 1, 3, 2, 5], \ 'target' : [4.5, 6.0, 9.5, 8, 8.5, 7], \ }) df_clf = pd.DataFrame({'num1' : [1, 1.2, 5, 8.2, 4, 3.5], \ 'num2' : [6, 2, 7, 3, 9, 8], \ 'num3' : [7, 0, 6, 4, 3, 2], \ 'num4' : [4, 3, 1, 3, 2, 5], \ 'target' : ['Bad', 'Bad', 'Good', 'Average', 'Good', 'Average'], \ }) display(df_reg) display(df_clf)
NUM1 | NUM2 | NUM3 | NUM4 | TARGET | |
---|---|---|---|---|---|
0 | 1.0 | 6 | 7 | 4 | 4.5 |
1 | 1.2 | 2 | 0 | 3 | 6.0 |
2 | 5.0 | 7 | 6 | 1 | 9.5 |
3 | 8.2 | 3 | 4 | 3 | 8.0 |
4 | 4.0 | 9 | 3 | 2 | 8.5 |
5 | 3.5 | 8 | 2 | 5 | 7.0 |
NUM1 | NUM2 | NUM3 | NUM4 | TARGET | |
---|---|---|---|---|---|
0 | 1.0 | 6 | 7 | 4 | Bad |
1 | 1.2 | 2 | 0 | 3 | Bad |
2 | 5.0 | 7 | 6 | 1 | Good |
3 | 8.2 | 3 | 4 | 3 | Average |
4 | 4.0 | 9 | 3 | 2 | Good |
5 | 3.5 | 8 | 2 | 5 | Average |
f_regression and f_classif
f_regression, as the name suggested, is used when the target variable is numerical, while f_classif is used for the categorical target variable.
Speaking in the most simple way, these 2 methods calculate the correlation between each predictor and the target variable. The predictor which has a higher absolute correlation with the target variable will have a higher score.
from sklearn.feature_selection import SelectKBest, f_regression X_reg = df_reg.drop(['target'], axis=1) y_reg = df_reg['target'] selector = SelectKBest(f_regression, k=3) new_X_reg = selector.fit_transform(X_reg, y_reg) print ('Predictor variables after selection:') display(new_X_reg) print ('Score of each predictors: ') display(selector.scores_)
Predictor variables after selection:
array([[1. , 6. , 4. ],
[1.2, 2. , 3. ],
[5. , 7. , 1. ],
[8.2, 3. , 3. ],
[4. , 9. , 2. ],
[3.5, 8. , 5. ]])
Score of each predictors:
array([4.21280744e+00, 4.79454065e-01, 1.83290057e-03, 3.91540785e+00])
Initially, our data frame has 4 predictors, after being selected by f_regression, there are only 3 remains. We can also see the score of each predictor. ‘num1’ has the highest score 4.2, while ‘num4’ closely follows with 3.9, ‘num2’ has a quite low score 0.48 and ‘num3’ is in the last position with just 0.0018.
Ok, let’s verify my explanation above (that is, I said that the score given by f_regression and f_classif is calculated using absolute correlation with target variable).
df_reg.corr()['target']
num1 0.716209
num2 0.327161
num3 0.021401
num4 -0.703318
target 1.000000
Name: target, dtype: float64
That’s true. We can see the score and the absolute correlation with the target variable follows the same pattern. Correlation between ‘num1’ and ‘target’ is 0.71 – the highest, just behind it is ‘num4’ with -0.7 (absolute value is 0.7), ‘num2‘ is quite far away and ‘num3′ is at the tail.
mutual_info_regression and mutual_info_classif
Same as above, we have 1 function for the numerical target variable and 1 function for the categorical target variable.
If what f_regression and f_classif compute is the correlation with the target, what mutual_info_regression and mutual_info_classif compute is the mutual information with the target, or also known as information gain.
If you have some experience with Decision Tree, you should have known this definition. If not, let me introduce: Information gain represents the difference between the target’s entropy before and after being split by a predictor.
Explaining mutual information in detail will go beyond the scope of this post, so if you feel curious and want to know more about it, please go to my tutorial about the Decision Tree’s splitting algorithms.
from sklearn.feature_selection import SelectKBest, mutual_info_classif X_clf = df_clf.drop(['target'], axis=1) y_clf = df_clf['target'] selector = SelectKBest(mutual_info_classif, k=3) new_X_clf = selector.fit_transform(X_clf, y_clf) print ('Predictor variables after selection:') display(new_X_clf)
Predictor variables after selection:
array([[1. , 7. , 4. ],
[1.2, 0. , 3. ],
[5. , 6. , 1. ],
[8.2, 4. , 3. ],
[4. , 3. , 2. ],
[3.5, 2. , 5. ]])
A note about this (and probably good news) is that: while f_regression and f_classif only capture the linear relationships between predictors and target, mutual_info_regression and mutual_info_classif can capture any kind of relationship (for example: linear, quadratic and exponential). So, in general, mutual_info is more robust and more versatile than f_. If we have no idea about which types of relationships can predictors have with the target, mutual_info is usually preferred over f_.
VarianceThreshold
This is one more useful criterion to eliminate features: variance of the variable in the training set.
A difference from the above 2 methods is that this one only takes as input the predictors but not the response variable. It calculates the variance of each predictor and then removes the ones that have a lower variance than the value we specified.
from sklearn.feature_selection import VarianceThreshold X = df_reg.drop(['target'], axis=1) selector = VarianceThreshold(threshold=3.0) new_X = selector.fit_transform(X) display(new_X)
array([[1. , 6. , 7. ],
[1.2, 2. , 0. ],
[5. , 7. , 6. ],
[8.2, 3. , 4. ],
[4. , 9. , 3. ],
[3.5, 8. , 2. ]])
Note that if we don’t specify a threshold (let the threshold uses default value), then the model will remove only the features that have only 1 value throughout the training data. Another note is that letting threshold with the default value is the most common choice of researchers when using VarianceThreshold.
RFE
RFE (recursive feature elimination) removes features one by one (or some by some, we can specify how many features to be eliminated each turn) until there are only n_features_to_select left (we also input the value of n_features_to_select).
The user inputs the method to choose which features to be eliminated in the form of a supervised estimator that supports calculating feature importance. For example, we can use LinearRegression (and its siblings Ridge, Lasso, and ElasticNet), LogisticRegression, RandomForest or SVM.
We use a regressor or a classifier depends on our response variable is numerical or categorical.
from sklearn.feature_selection import RFE from sklearn.linear_model import LogisticRegression linear_regression = LogisticRegression(solver='lbfgs', multi_class='auto') selector = RFE(linear_regression, n_features_to_select=2, step=1) X_clf = df_clf.drop(['target'], axis=1) y_clf = df_clf['target'] new_X_clf = selector.fit_transform(X_clf, y_clf) display(new_X_clf)
array([[1. , 4. ],
[1.2, 3. ],
[5. , 1. ],
[8.2, 3. ],
[4. , 2. ],
[3.5, 5. ]])
Conclusion
Above, we have gone through some of the most simple, easy-to-use and popular methods to do feature selection with sklearn. These are the standard approaches we should always try when working with our dataset. Sometimes, we try and take the result, sometimes, we just try and see what the result is to have more understanding about the data.
The last note is that the above methods only handle numerical predictor variables. For cases when we want to eliminate categorical predictors, we would have to convert them to numerical first. Why and how to convert categorical variables to numerical is published in another post.
Happy reading!
Test your understanding |
|