How to deal with missing values (NaNs)

Test your knowledge

This blog post attempts to address why NaNs are bad and how we can fix them.

The content is structured as below:

Reasons to handle NaNs
Methods to handle NaNs
Fill-in with 0s
Fill-in with mean, median or mode
Fill NaN with different values for each column
Create a new value representing NaN
Back-fill and Forward-fill
Delete the whole row or column

Why do we have to handle NaNs?

NaNs often mean Unknown, which is what we don’t like. If we want to extract information from data, we have to know the data, not “unknown”.
NaNs can misdirect us from the truth.
Some (or maybe I should say Most) of our predictive models do not work when there are NaN values in the input.

Methods to handle NaNs

First, let’s import some useful libraries.

import numpy as np
import pandas as pd

And create a toy data frame.

df = pd.DataFrame({'num1' : [1, 1.2, 5, np.nan, 4, 3.5, 4], 
                   'num2' : [np.nan, 2, 7, 3, 9, 8, 2],
                   'num3' : [7, 0, 6, np.nan, 3, 2, np.nan],
                   'cat1' : ['pen', 'book', 'pencil', 'book', np.nan, 'book', 'pen'],
                   'cat2' : [np.nan, np.nan, np.nan, 'google', np.nan, np.nan, np.nan],
                   'target' : [4.5, 6.0, 9.5, 8, 8.5, 7, np.nan]
                  })
print(df)

   num1  num2  num3    cat1    cat2  target
0   1.0   NaN   7.0     pen     NaN     4.5
1   1.2   2.0   0.0    book     NaN     6.0
2   5.0   7.0   6.0  pencil     NaN     9.5
3   NaN   3.0   NaN    book  google     8.0
4   4.0   9.0   3.0     NaN     NaN     8.5
5   3.5   8.0   2.0    book     NaN     7.0
6   4.0   2.0   NaN     pen     NaN     NaN

Suppose we are working on a regression project.

Our data frame is named df.
We have 5 predictor variables. Amongst them, 3 are numerical: from num1 to num3, the remaining 2 are categorial: cat1 and cat2.
The response variable is named target and is numerical.

To have an overview of our NaN status:

# NaN count on columns
print('The number of Nan values on each column:')
print(df.isna().sum().sort_values(ascending=False))

# NaN count on rows
print('The number of Nan values on each row:')
print(df.isna().sum(axis=1).sort_values(ascending=False))

The number of Nan values on each column:
cat2      6
num3      2
target    1
cat1      1
num2      1
num1      1
dtype: int64
The number of Nan values on each row:
6    3
4    2
3    2
0    2
5    1
2    1
1    1
dtype: int64

Okay, now we will just hop onto some methods to deal with these NaN values.

Fill-in with 0s

In many cases, NaN values just mean 0. Remember to read your data description carefully to ensure this assumption.

Note that filling with 0 only works for numerical variables.

new_df = df.copy()
new_df[['num1', 'num2', 'num3']] = new_df[['num1', 'num2', 'num3']].fillna(0)
print(new_df)

   num1  num2  num3    cat1    cat2  target
0   1.0   0.0   7.0     pen     NaN     4.5
1   1.2   2.0   0.0    book     NaN     6.0
2   5.0   7.0   6.0  pencil     NaN     9.5
3   0.0   3.0   0.0    book  google     8.0
4   4.0   9.0   3.0     NaN     NaN     8.5
5   3.5   8.0   2.0    book     NaN     7.0
6   4.0   2.0   0.0     pen     NaN     NaN

Fill-in with mean, median or mode

This is one of the most popular ways that many researchers prefer, because:

It is simple, hence not prone to overfitting.
It works well if we don’t have much information about the data.
It normally provides competent results.

Note that mean and median are only used for numerical variables, while the mode is usually used for categorical variables.

# fill every NaN with the mean of non-NaN values on its column.
# Note: this works only for numerical columns
new_df = df.fillna(df.mean())
print(new_df)

       num1      num2  num3    cat1    cat2  target
0  1.000000  5.166667   7.0     pen     NaN    4.50
1  1.200000  2.000000   0.0    book     NaN    6.00
2  5.000000  7.000000   6.0  pencil     NaN    9.50
3  3.116667  3.000000   3.6    book  google    8.00
4  4.000000  9.000000   3.0     NaN     NaN    8.50
5  3.500000  8.000000   2.0    book     NaN    7.00
6  4.000000  2.000000   3.6     pen     NaN    7.25

# fill NaN by median, but for only specified column.
new_df = df.copy()
new_df['num1'] = new_df['num1'].fillna(new_df['num1'].median())
print(new_df)

   num1  num2  num3    cat1    cat2  target
0  1.00   NaN   7.0     pen     NaN     4.5
1  1.20   2.0   0.0    book     NaN     6.0
2  5.00   7.0   6.0  pencil     NaN     9.5
3  3.75   3.0   NaN    book  google     8.0
4  4.00   9.0   3.0     NaN     NaN     8.5
5  3.50   8.0   2.0    book     NaN     7.0
6  4.00   2.0   NaN     pen     NaN     NaN

# fill by mode
# Note: This method is normally used with categorical varibles only.
new_df = df.copy()
new_df['cat1'] = new_df['cat1'].fillna(new_df['cat1'].mode()[0])
print(new_df)

   num1  num2  num3    cat1    cat2  target
0   1.0   NaN   7.0     pen     NaN     4.5
1   1.2   2.0   0.0    book     NaN     6.0
2   5.0   7.0   6.0  pencil     NaN     9.5
3   NaN   3.0   NaN    book  google     8.0
4   4.0   9.0   3.0    book     NaN     8.5
5   3.5   8.0   2.0    book     NaN     7.0
6   4.0   2.0   NaN     pen     NaN     NaN

Fill NaN with different values for each column

We may create a dictionary to map each column to a default value.

map_values = {'num1' : 3, 'num2' : 6, 'num3' : 0, 
              'cat1' : 'notebook', 'cat2' : 'microsoft'}
new_df = df.fillna(map_values)
print(new_df)

   num1  num2  num3      cat1       cat2  target
0   1.0   6.0   7.0       pen  microsoft     4.5
1   1.2   2.0   0.0      book  microsoft     6.0
2   5.0   7.0   6.0    pencil  microsoft     9.5
3   3.0   3.0   0.0      book     google     8.0
4   4.0   9.0   3.0  notebook  microsoft     8.5
5   3.5   8.0   2.0      book  microsoft     7.0
6   4.0   2.0   0.0       pen  microsoft     NaN

Create a new value representing NaN (categorical variables)

If the column’s values are categorical, we can just assign NaN to a new value that does not coincide with any others in the column.

new_df = df.copy()
new_df['cat1'] = new_df['cat1'].fillna('no_info')
print(new_df)

   num1  num2  num3     cat1    cat2  target
0   1.0   NaN   7.0      pen     NaN     4.5
1   1.2   2.0   0.0     book     NaN     6.0
2   5.0   7.0   6.0   pencil     NaN     9.5
3   NaN   3.0   NaN     book  google     8.0
4   4.0   9.0   3.0  no_info     NaN     8.5
5   3.5   8.0   2.0     book     NaN     7.0
6   4.0   2.0   NaN      pen     NaN     NaN

Back-fill and Forward-fill

This is often used for time-series data.

Back-fill means taking the value in the former row to fill-in the prior rows, while Forward-fill is the opposite.

# Back-fill
new_df = df.fillna(method='bfill')
print(new_df)

   num1  num2  num3    cat1    cat2  target
0   1.0   2.0   7.0     pen  google     4.5
1   1.2   2.0   0.0    book  google     6.0
2   5.0   7.0   6.0  pencil  google     9.5
3   4.0   3.0   3.0    book  google     8.0
4   4.0   9.0   3.0    book     NaN     8.5
5   3.5   8.0   2.0    book     NaN     7.0
6   4.0   2.0   NaN     pen     NaN     NaN

# Forward-fill
new_df = df.fillna(method='ffill')
print(new_df)

   num1  num2  num3    cat1    cat2  target
0   1.0   NaN   7.0     pen     NaN     4.5
1   1.2   2.0   0.0    book     NaN     6.0
2   5.0   7.0   6.0  pencil     NaN     9.5
3   5.0   3.0   6.0    book  google     8.0
4   4.0   9.0   3.0    book  google     8.5
5   3.5   8.0   2.0    book  google     7.0
6   4.0   2.0   2.0     pen  google     7.0

Delete the whole row or column

Be very careful when applying this method. When you remove anything from your data, you can potentially be missing some useful information that cannot be compensated.

However, cases by cases, sometimes you need to delete some rows or columns to not let them ruin the good value of your data. For example: when you have thousands of samples with perfect values (no NaN), but only 2 samples that have NaN on almost all columns, and you further find out that these NaN are produced by an error in your sensors, then you should delete these 2 rows. Another case you can consider is when your response variable contains several NaN, if removing these samples does not affect your data distribution, you should probably delete them.

# We delete column 'cat2' because it contains 6 NaNs over 7 samples.
new_df = df.drop(['cat2'], axis=1)
print(new_df)

   num1  num2  num3    cat1  target
0   1.0   NaN   7.0     pen     4.5
1   1.2   2.0   0.0    book     6.0
2   5.0   7.0   6.0  pencil     9.5
3   NaN   3.0   NaN    book     8.0
4   4.0   9.0   3.0     NaN     8.5
5   3.5   8.0   2.0    book     7.0
6   4.0   2.0   NaN     pen     NaN

# We delete the last row because its response value is NaN
new_df = df[df.target.notna()]
print(new_df)

   num1  num2  num3    cat1    cat2  target
0   1.0   NaN   7.0     pen     NaN     4.5
1   1.2   2.0   0.0    book     NaN     6.0
2   5.0   7.0   6.0  pencil     NaN     9.5
3   NaN   3.0   NaN    book  google     8.0
4   4.0   9.0   3.0     NaN     NaN     8.5
5   3.5   8.0   2.0    book     NaN     7.0

Test your understanding

Tung M Phung's Blog

How to deal with missing values (NaNs)

Leave a ReplyCancel reply