Test your knowledge |
|
This blog post attempts to address why NaNs are bad and how we can fix them.
The content is structured as below:
Reasons to handle NaNs
Methods to handle NaNs
  Fill-in with 0s
  Fill-in with mean, median or mode
  Fill NaN with different values for each column
  Create a new value representing NaN
  Back-fill and Forward-fill
  Delete the whole row or column
Why do we have to handle NaNs?
Methods to handle NaNs
First, let’s import some useful libraries.
import numpy as np import pandas as pd
And create a toy data frame.
df = pd.DataFrame({'num1' : [1, 1.2, 5, np.nan, 4, 3.5, 4], 'num2' : [np.nan, 2, 7, 3, 9, 8, 2], 'num3' : [7, 0, 6, np.nan, 3, 2, np.nan], 'cat1' : ['pen', 'book', 'pencil', 'book', np.nan, 'book', 'pen'], 'cat2' : [np.nan, np.nan, np.nan, 'google', np.nan, np.nan, np.nan], 'target' : [4.5, 6.0, 9.5, 8, 8.5, 7, np.nan] }) print(df)
num1 num2 num3 cat1 cat2 target
0 1.0 NaN 7.0 pen NaN 4.5
1 1.2 2.0 0.0 book NaN 6.0
2 5.0 7.0 6.0 pencil NaN 9.5
3 NaN 3.0 NaN book google 8.0
4 4.0 9.0 3.0 NaN NaN 8.5
5 3.5 8.0 2.0 book NaN 7.0
6 4.0 2.0 NaN pen NaN NaN
Suppose we are working on a regression project.
To have an overview of our NaN status:
# NaN count on columns print('The number of Nan values on each column:') print(df.isna().sum().sort_values(ascending=False)) # NaN count on rows print('The number of Nan values on each row:') print(df.isna().sum(axis=1).sort_values(ascending=False))
The number of Nan values on each column:
cat2 6
num3 2
target 1
cat1 1
num2 1
num1 1
dtype: int64
The number of Nan values on each row:
6 3
4 2
3 2
0 2
5 1
2 1
1 1
dtype: int64
Okay, now we will just hop onto some methods to deal with these NaN values.
In many cases, NaN values just mean 0. Remember to read your data description carefully to ensure this assumption.
Note that filling with 0 only works for numerical variables.
new_df = df.copy() new_df[['num1', 'num2', 'num3']] = new_df[['num1', 'num2', 'num3']].fillna(0) print(new_df)
num1 num2 num3 cat1 cat2 target
0 1.0 0.0 7.0 pen NaN 4.5
1 1.2 2.0 0.0 book NaN 6.0
2 5.0 7.0 6.0 pencil NaN 9.5
3 0.0 3.0 0.0 book google 8.0
4 4.0 9.0 3.0 NaN NaN 8.5
5 3.5 8.0 2.0 book NaN 7.0
6 4.0 2.0 0.0 pen NaN NaN
This is one of the most popular ways that many researchers prefer, because:
Note that mean and median are only used for numerical variables, while the mode is usually used for categorical variables.
# fill every NaN with the mean of non-NaN values on its column. # Note: this works only for numerical columns new_df = df.fillna(df.mean()) print(new_df)
num1 num2 num3 cat1 cat2 target
0 1.000000 5.166667 7.0 pen NaN 4.50
1 1.200000 2.000000 0.0 book NaN 6.00
2 5.000000 7.000000 6.0 pencil NaN 9.50
3 3.116667 3.000000 3.6 book google 8.00
4 4.000000 9.000000 3.0 NaN NaN 8.50
5 3.500000 8.000000 2.0 book NaN 7.00
6 4.000000 2.000000 3.6 pen NaN 7.25
# fill NaN by median, but for only specified column. new_df = df.copy() new_df['num1'] = new_df['num1'].fillna(new_df['num1'].median()) print(new_df)
num1 num2 num3 cat1 cat2 target
0 1.00 NaN 7.0 pen NaN 4.5
1 1.20 2.0 0.0 book NaN 6.0
2 5.00 7.0 6.0 pencil NaN 9.5
3 3.75 3.0 NaN book google 8.0
4 4.00 9.0 3.0 NaN NaN 8.5
5 3.50 8.0 2.0 book NaN 7.0
6 4.00 2.0 NaN pen NaN NaN
# fill by mode # Note: This method is normally used with categorical varibles only. new_df = df.copy() new_df['cat1'] = new_df['cat1'].fillna(new_df['cat1'].mode()[0]) print(new_df)
num1 num2 num3 cat1 cat2 target
0 1.0 NaN 7.0 pen NaN 4.5
1 1.2 2.0 0.0 book NaN 6.0
2 5.0 7.0 6.0 pencil NaN 9.5
3 NaN 3.0 NaN book google 8.0
4 4.0 9.0 3.0 book NaN 8.5
5 3.5 8.0 2.0 book NaN 7.0
6 4.0 2.0 NaN pen NaN NaN
We may create a dictionary to map each column to a default value.
map_values = {'num1' : 3, 'num2' : 6, 'num3' : 0, 'cat1' : 'notebook', 'cat2' : 'microsoft'} new_df = df.fillna(map_values) print(new_df)
num1 num2 num3 cat1 cat2 target
0 1.0 6.0 7.0 pen microsoft 4.5
1 1.2 2.0 0.0 book microsoft 6.0
2 5.0 7.0 6.0 pencil microsoft 9.5
3 3.0 3.0 0.0 book google 8.0
4 4.0 9.0 3.0 notebook microsoft 8.5
5 3.5 8.0 2.0 book microsoft 7.0
6 4.0 2.0 0.0 pen microsoft NaN
If the column’s values are categorical, we can just assign NaN to a new value that does not coincide with any others in the column.
new_df = df.copy() new_df['cat1'] = new_df['cat1'].fillna('no_info') print(new_df)
num1 num2 num3 cat1 cat2 target
0 1.0 NaN 7.0 pen NaN 4.5
1 1.2 2.0 0.0 book NaN 6.0
2 5.0 7.0 6.0 pencil NaN 9.5
3 NaN 3.0 NaN book google 8.0
4 4.0 9.0 3.0 no_info NaN 8.5
5 3.5 8.0 2.0 book NaN 7.0
6 4.0 2.0 NaN pen NaN NaN
This is often used for time-series data.
Back-fill means taking the value in the former row to fill-in the prior rows, while Forward-fill is the opposite.
# Back-fill new_df = df.fillna(method='bfill') print(new_df)
num1 num2 num3 cat1 cat2 target
0 1.0 2.0 7.0 pen google 4.5
1 1.2 2.0 0.0 book google 6.0
2 5.0 7.0 6.0 pencil google 9.5
3 4.0 3.0 3.0 book google 8.0
4 4.0 9.0 3.0 book NaN 8.5
5 3.5 8.0 2.0 book NaN 7.0
6 4.0 2.0 NaN pen NaN NaN
# Forward-fill new_df = df.fillna(method='ffill') print(new_df)
num1 num2 num3 cat1 cat2 target
0 1.0 NaN 7.0 pen NaN 4.5
1 1.2 2.0 0.0 book NaN 6.0
2 5.0 7.0 6.0 pencil NaN 9.5
3 5.0 3.0 6.0 book google 8.0
4 4.0 9.0 3.0 book google 8.5
5 3.5 8.0 2.0 book google 7.0
6 4.0 2.0 2.0 pen google 7.0
Be very careful when applying this method. When you remove anything from your data, you can potentially be missing some useful information that cannot be compensated.
However, cases by cases, sometimes you need to delete some rows or columns to not let them ruin the good value of your data. For example: when you have thousands of samples with perfect values (no NaN), but only 2 samples that have NaN on almost all columns, and you further find out that these NaN are produced by an error in your sensors, then you should delete these 2 rows. Another case you can consider is when your response variable contains several NaN, if removing these samples does not affect your data distribution, you should probably delete them.
# We delete column 'cat2' because it contains 6 NaNs over 7 samples. new_df = df.drop(['cat2'], axis=1) print(new_df)
num1 num2 num3 cat1 target
0 1.0 NaN 7.0 pen 4.5
1 1.2 2.0 0.0 book 6.0
2 5.0 7.0 6.0 pencil 9.5
3 NaN 3.0 NaN book 8.0
4 4.0 9.0 3.0 NaN 8.5
5 3.5 8.0 2.0 book 7.0
6 4.0 2.0 NaN pen NaN
# We delete the last row because its response value is NaN new_df = df[df.target.notna()] print(new_df)
num1 num2 num3 cat1 cat2 target
0 1.0 NaN 7.0 pen NaN 4.5
1 1.2 2.0 0.0 book NaN 6.0
2 5.0 7.0 6.0 pencil NaN 9.5
3 NaN 3.0 NaN book google 8.0
4 4.0 9.0 3.0 NaN NaN 8.5
5 3.5 8.0 2.0 book NaN 7.0
Test your understanding |
|