Test your knowledge |
|
In Machine Learning, while some predictive models allow categorical variables in the data (e.g. Naive Bayes, Decision Tree), most require all predictor variables to be continuous (e.g. Linear Regression, Logistic Regression, Neural Networks, Support Vector Machine, Linear Discriminant Analysis). Hence, it is very common that we have to transform our data, from categorical values to numerical ones.
In this blog post, we will enumerate some most general methods to do so. Those are:
One-hot encoding (or dummies encoding)
Map ordinal values to numbers
Map values to their statistics
Mark special categories
1. One-hot encoding (or dummies encoding)
This is a simple, non-parametric method that can be used for any kind of categorical variables without any assumptions about their values. If our (categorical) feature has, for example, 5 distinct values, we split this (categorical) feature into 5 (numerical) features, each corresponds to a distinct value. For these 5 new features, only one of them has value 1, while the others are all 0. For each sample data point, the feature which has value 1 is the feature corresponding to this data point’s value in the original categorical feature.
Pandas supports this transformation with the function get_dummies that we will see below. However, let’s start with creating a sample dataset for our experiments:
import pandas as pd data = pd.DataFrame({ 'Language' : ['VN', 'ENG', 'DE', 'DE', 'VN', 'ENG', 'VN', 'DE'], 'Density' : ['High', 'Medium', 'Low', 'Medium', 'Medium', 'High', 'Low', 'High'], 'Ethnic Group' : ['Kinh', 'Dao', 'Kinh', 'Kinh', 'Kinh', 'Kinh', 'Hmong', 'Kinh'], 'Target' : [12, 5, 3, 6, 9, 10, 6, 8] }) print(data)
Language Density Ethnic Group Target 0 VN High Kinh 12 1 ENG Medium Dao 5 2 DE Low Kinh 3 3 DE Medium Kinh 6 4 VN Medium Kinh 9 5 ENG High Kinh 10 6 VN Low Hmong 6 7 DE High Kinh 8
Here we encode the column Language:
language_data = pd.get_dummies(data.Language) print(language_data)
DE ENG VN 0 0 0 1 1 0 1 0 2 1 0 0 3 1 0 0 4 0 0 1 5 0 1 0 6 0 0 1 7 1 0 0
In case we want to plug it back to our data frame:
language_data = pd.get_dummies(data.Language) new_data = data.drop(['Language'], axis=1) new_data = pd.concat((new_data, language_data), axis=1) print(new_data)
Density Ethnic Group Target DE ENG VN 0 High Kinh 12 0 0 1 1 Medium Dao 5 0 1 0 2 Low Kinh 3 1 0 0 3 Medium Kinh 6 1 0 0 4 Medium Kinh 9 0 0 1 5 High Kinh 10 0 1 0 6 Low Hmong 6 0 0 1 7 High Kinh 8 1 0 0
An advantage of One-hot-encoding is that it has no requirements for the input categorical data. Thus, it serves best as a grounding-method, which we can use if no other methods can do the work more effectively and more intrinsically.
This one, however, will expand the width of our data for quite a lot if the number of distinct values is huge. Each unique value makes a new column, so if we have, for instance, 1000 unique values, using one-hot-encoder will add into our data 999 more columns (1000 minus 1), which demands more memory, storage and can significantly raise the processing time of subsequent operations.
A way to resolve the above issue is to combine some values into one before performing one-hot. For example, as there are around 6500 languages in the world today, it is too large to encode each of these languages to a column. Instead, we can, for instance, group the languages by the number of native speakers (languages with fewer than 1000 speakers fall into group 1, languages with 1000 to 100000 speakers fall into group 2 and the others are put into group 3) or writing system (Latin vs Greek vs Cyrillic, etc.). By combining, we reduce the number of values to be encoded, with the risk of losing some of the information (yet sometimes, combining values, which is a form of feature extraction, can lead to better information).
2. Map ordinal values to numbers
If the values of the categorical variable are ordinal, i.e. they represent some kind of intensity, we can create a map to reflect each value to a number.
For example, if the values of a variable are Low, Medium and High, we can map them to 1, 2 and 3, respectively. This transformation is illustrated below:
density_map = { 'Low' : 1, 'Medium' : 2, 'High' : 3 } density_data = data['Density'].map(density_map) print(density_data)
0 3 1 2 2 1 3 2 4 2 5 3 6 1 7 3 Name: Density, dtype: int64
And deliver the transformed data to our data-frame:
new_data = data.copy() new_data['Density'] = density_data print(new_data)
Language Density Ethnic Group Target 0 VN 3 Kinh 12 1 ENG 2 Dao 5 2 DE 1 Kinh 3 3 DE 2 Kinh 6 4 VN 2 Kinh 9 5 ENG 3 Kinh 10 6 VN 1 Hmong 6 7 DE 3 Kinh 8
3. Map values to their statistics
There are various ways to replace a categorical value with its statistics and our selection should be based on our knowledge of the variable-to-be-transformed.
Some common ways are:
Replace a categorical value with the mean of the target value of that category. For example, instead of transforming the Density column’s values to [1, 2, 3], we compute their Target’s mean values as below:
print(data.groupby(['Density']).\ agg({'Target' : 'mean'}))
Target Density 1 4.500000 2 6.666667 3 10.000000
And map those value back:
density_map = data.groupby(['Density']).\ agg({'Target' : 'mean'}).to_dict()['Target'] new_data = data.copy() new_data['Density'] = new_data['Density'].map(density_map) print(new_data)
Language Density Ethnic Group Target 0 VN 10.000000 Kinh 12 1 ENG 6.666667 Dao 5 2 DE 4.500000 Kinh 3 3 DE 6.666667 Kinh 6 4 VN 6.666667 Kinh 9 5 ENG 10.000000 Kinh 10 6 VN 4.500000 Hmong 6 7 DE 10.000000 Kinh 8
Replace a categorical value with the count (or proportion) of its category.
density_type_count = \ data.groupby(['Density']).size().to_dict() print(density_type_count)
{1: 2, 2: 3, 3: 3}
And then:
new_data = data.copy() new_data['Density Type Count'] = \ new_data['Density'].map(density_type_count) new_data = new_data.drop(['Density'], axis=1) print(new_data)
Language Ethnic Group Target Density Type Count 0 VN Kinh 12 3 1 ENG Dao 5 3 2 DE Kinh 3 2 3 DE Kinh 6 3 4 VN Kinh 9 3 5 ENG Kinh 10 3 6 VN Hmong 6 2 7 DE Kinh 8 3
4. Mark special categories
Sometimes, within a categorical variable, there are some values that are more special than the others (this specialty should be intrinsic or statistical). For instance, in the predictor variable Ethnic Groups, there are minority and majority groups. The minority groups usually have very different conditions of life, jobs, and even perspective from the majority ones. Hence, in many cases, we should make a distinction between them in favor of feature extraction. Let’s create a new column, whose value is 0 if the ethnic group is minority and 1 otherwise.
# define a list of all majority groups majority_groups = ['Kinh'] # transforming data new_data = data.copy() new_data['Is Majority'] = \ new_data['Ethnic Group'].\ apply(lambda x : int(x in majority_groups)) print(new_data)
Language Density Ethnic Group Target Is Majority 0 VN 3 Kinh 12 1 1 ENG 2 Dao 5 0 2 DE 1 Kinh 3 1 3 DE 2 Kinh 6 1 4 VN 2 Kinh 9 1 5 ENG 3 Kinh 10 1 6 VN 1 Hmong 6 0 7 DE 3 Kinh 8 1
The original Ethnic Group column can be either discarded (because we have the new Is Majority column which extracted some information from it) or kept remain and further transformed by other operations.
Test your understanding |
|
One-hot encoding code didn’t work.
Hi Blake,
I just checked the code, it works well on my side.
Could you specify which part that didn’t work for you?
Best,
Tung
Hello Tung.
On my dissertation, I have used the one-hot encoding code to form my dataset, but now i need to change it from a monthly frequency to a yearly. Any ideas how i could do this without losing all the information gathered ?
Best regards,
Savvas