How to convert Categorical Variables to Numerical Variables

Test your knowledge

In Machine Learning, while some predictive models allow categorical variables in the data (e.g. Naive Bayes, Decision Tree), most require all predictor variables to be continuous (e.g. Linear Regression, Logistic Regression, Neural Networks, Support Vector Machine, Linear Discriminant Analysis). Hence, it is very common that we have to transform our data, from categorical values to numerical ones.

In this blog post, we will enumerate some most general methods to do so. Those are:

One-hot encoding (or dummies encoding)
Map ordinal values to numbers
Map values to their statistics
Mark special categories

1. One-hot encoding (or dummies encoding)

This is a simple, non-parametric method that can be used for any kind of categorical variables without any assumptions about their values. If our (categorical) feature has, for example, 5 distinct values, we split this (categorical) feature into 5 (numerical) features, each corresponds to a distinct value. For these 5 new features, only one of them has value 1, while the others are all 0. For each sample data point, the feature which has value 1 is the feature corresponding to this data point’s value in the original categorical feature.

Pandas supports this transformation with the function get_dummies that we will see below. However, let’s start with creating a sample dataset for our experiments:

import pandas as pd

data = pd.DataFrame({
    'Language' : ['VN', 'ENG', 'DE', 'DE', 
                  'VN', 'ENG', 'VN', 'DE'], 
    'Density' : ['High', 'Medium', 'Low', 'Medium', 
                 'Medium', 'High', 'Low', 'High'],
    'Ethnic Group' : ['Kinh', 'Dao', 'Kinh', 'Kinh', 
                      'Kinh', 'Kinh', 'Hmong', 'Kinh'],
    'Target' : [12, 5, 3, 6, 9, 10, 6, 8]
})

print(data)

  Language Density Ethnic Group  Target
0       VN    High         Kinh      12
1      ENG  Medium          Dao       5
2       DE     Low         Kinh       3
3       DE  Medium         Kinh       6
4       VN  Medium         Kinh       9
5      ENG    High         Kinh      10
6       VN     Low        Hmong       6
7       DE    High         Kinh       8

Here we encode the column Language:

language_data = pd.get_dummies(data.Language)
print(language_data)

   DE  ENG  VN
0   0    0   1
1   0    1   0
2   1    0   0
3   1    0   0
4   0    0   1
5   0    1   0
6   0    0   1
7   1    0   0

In case we want to plug it back to our data frame:

language_data = pd.get_dummies(data.Language)
new_data = data.drop(['Language'], axis=1)
new_data = pd.concat((new_data, language_data), axis=1)
print(new_data)

  Density Ethnic Group  Target  DE  ENG  VN
0    High         Kinh      12   0    0   1
1  Medium          Dao       5   0    1   0
2     Low         Kinh       3   1    0   0
3  Medium         Kinh       6   1    0   0
4  Medium         Kinh       9   0    0   1
5    High         Kinh      10   0    1   0
6     Low        Hmong       6   0    0   1
7    High         Kinh       8   1    0   0

An advantage of One-hot-encoding is that it has no requirements for the input categorical data. Thus, it serves best as a grounding-method, which we can use if no other methods can do the work more effectively and more intrinsically.

This one, however, will expand the width of our data for quite a lot if the number of distinct values is huge. Each unique value makes a new column, so if we have, for instance, 1000 unique values, using one-hot-encoder will add into our data 999 more columns (1000 minus 1), which demands more memory, storage and can significantly raise the processing time of subsequent operations.

A way to resolve the above issue is to combine some values into one before performing one-hot. For example, as there are around 6500 languages in the world today, it is too large to encode each of these languages to a column. Instead, we can, for instance, group the languages by the number of native speakers (languages with fewer than 1000 speakers fall into group 1, languages with 1000 to 100000 speakers fall into group 2 and the others are put into group 3) or writing system (Latin vs Greek vs Cyrillic, etc.). By combining, we reduce the number of values to be encoded, with the risk of losing some of the information (yet sometimes, combining values, which is a form of feature extraction, can lead to better information).

2. Map ordinal values to numbers

If the values of the categorical variable are ordinal, i.e. they represent some kind of intensity, we can create a map to reflect each value to a number.

For example, if the values of a variable are Low, Medium and High, we can map them to 1, 2 and 3, respectively. This transformation is illustrated below:

density_map = {
    'Low' : 1,
    'Medium' : 2,
    'High' : 3
}

density_data = data['Density'].map(density_map)
print(density_data)

0    3
1    2
2    1
3    2
4    2
5    3
6    1
7    3
Name: Density, dtype: int64

And deliver the transformed data to our data-frame:

new_data = data.copy()
new_data['Density'] = density_data
print(new_data)

  Language  Density Ethnic Group  Target
0       VN        3         Kinh      12
1      ENG        2          Dao       5
2       DE        1         Kinh       3
3       DE        2         Kinh       6
4       VN        2         Kinh       9
5      ENG        3         Kinh      10
6       VN        1        Hmong       6
7       DE        3         Kinh       8

3. Map values to their statistics

There are various ways to replace a categorical value with its statistics and our selection should be based on our knowledge of the variable-to-be-transformed.

Some common ways are:

$\blacktriangleright$ Replace a categorical value with the mean of the target value of that category. For example, instead of transforming the Density column’s values to [1, 2, 3], we compute their Target’s mean values as below:

print(data.groupby(['Density']).\
      agg({'Target' : 'mean'}))

            Target
Density           
1         4.500000
2         6.666667
3        10.000000

And map those value back:

density_map = data.groupby(['Density']).\
    agg({'Target' : 'mean'}).to_dict()['Target']

new_data = data.copy()
new_data['Density'] = new_data['Density'].map(density_map)
print(new_data)

  Language    Density Ethnic Group  Target
0       VN  10.000000         Kinh      12
1      ENG   6.666667          Dao       5
2       DE   4.500000         Kinh       3
3       DE   6.666667         Kinh       6
4       VN   6.666667         Kinh       9
5      ENG  10.000000         Kinh      10
6       VN   4.500000        Hmong       6
7       DE  10.000000         Kinh       8

$\blacktriangleright$ Replace a categorical value with the count (or proportion) of its category.

density_type_count = \
    data.groupby(['Density']).size().to_dict()
print(density_type_count)

{1: 2, 2: 3, 3: 3}

And then:

new_data = data.copy()
new_data['Density Type Count'] = \
    new_data['Density'].map(density_type_count)
new_data = new_data.drop(['Density'], axis=1)
print(new_data)

  Language Ethnic Group  Target  Density Type Count
0       VN         Kinh      12                   3
1      ENG          Dao       5                   3
2       DE         Kinh       3                   2
3       DE         Kinh       6                   3
4       VN         Kinh       9                   3
5      ENG         Kinh      10                   3
6       VN        Hmong       6                   2
7       DE         Kinh       8                   3

4. Mark special categories

Sometimes, within a categorical variable, there are some values that are more special than the others (this specialty should be intrinsic or statistical). For instance, in the predictor variable Ethnic Groups, there are minority and majority groups. The minority groups usually have very different conditions of life, jobs, and even perspective from the majority ones. Hence, in many cases, we should make a distinction between them in favor of feature extraction. Let’s create a new column, whose value is 0 if the ethnic group is minority and 1 otherwise.

# define a list of all majority groups
majority_groups = ['Kinh']

# transforming data
new_data = data.copy()
new_data['Is Majority'] = \
    new_data['Ethnic Group'].\
    apply(lambda x : int(x in majority_groups))
print(new_data)

  Language  Density Ethnic Group  Target  Is Majority
0       VN        3         Kinh      12            1
1      ENG        2          Dao       5            0
2       DE        1         Kinh       3            1
3       DE        2         Kinh       6            1
4       VN        2         Kinh       9            1
5      ENG        3         Kinh      10            1
6       VN        1        Hmong       6            0
7       DE        3         Kinh       8            1

The original Ethnic Group column can be either discarded (because we have the new Is Majority column which extracted some information from it) or kept remain and further transformed by other operations.

Test your understanding

3 thoughts on “How to convert Categorical Variables to Numerical Variables”

Blake says:

November 6, 2022 at 4:01 pm

One-hot encoding code didn’t work.

1. Tung.M.Phung says:
  
  December 12, 2022 at 5:23 pm
  
  Hi Blake,
  I just checked the code, it works well on my side.
  Could you specify which part that didn’t work for you?
  
  Best,
  Tung
  
Savvas says:

July 21, 2023 at 5:54 pm

Hello Tung.
On my dissertation, I have used the one-hot encoding code to form my dataset, but now i need to change it from a monthly frequency to a yearly. Any ideas how i could do this without losing all the information gathered ?

Best regards,
Savvas

Tung M Phung's Blog