
| Test your knowledge | 
|  | 
This blog post attempts to address why NaNs are bad and how we can fix them.
The content is structured as below:
Reasons to handle NaNs
Methods to handle NaNs
  Fill-in with 0s
  Fill-in with mean, median or mode
  Fill NaN with different values for each column
  Create a new value representing NaN
  Back-fill and Forward-fill
  Delete the whole row or column
Why do we have to handle NaNs?
Methods to handle NaNs
First, let’s import some useful libraries.
import numpy as np import pandas as pd
And create a toy data frame.
df = pd.DataFrame({'num1' : [1, 1.2, 5, np.nan, 4, 3.5, 4], 
                   'num2' : [np.nan, 2, 7, 3, 9, 8, 2],
                   'num3' : [7, 0, 6, np.nan, 3, 2, np.nan],
                   'cat1' : ['pen', 'book', 'pencil', 'book', np.nan, 'book', 'pen'],
                   'cat2' : [np.nan, np.nan, np.nan, 'google', np.nan, np.nan, np.nan],
                   'target' : [4.5, 6.0, 9.5, 8, 8.5, 7, np.nan]
                  })
print(df)
   num1  num2  num3    cat1    cat2  target
0   1.0   NaN   7.0     pen     NaN     4.5
1   1.2   2.0   0.0    book     NaN     6.0
2   5.0   7.0   6.0  pencil     NaN     9.5
3   NaN   3.0   NaN    book  google     8.0
4   4.0   9.0   3.0     NaN     NaN     8.5
5   3.5   8.0   2.0    book     NaN     7.0
6   4.0   2.0   NaN     pen     NaN     NaNSuppose we are working on a regression project.
To have an overview of our NaN status:
# NaN count on columns
print('The number of Nan values on each column:')
print(df.isna().sum().sort_values(ascending=False))
# NaN count on rows
print('The number of Nan values on each row:')
print(df.isna().sum(axis=1).sort_values(ascending=False))
The number of Nan values on each column:
cat2      6
num3      2
target    1
cat1      1
num2      1
num1      1
dtype: int64
The number of Nan values on each row:
6    3
4    2
3    2
0    2
5    1
2    1
1    1
dtype: int64Okay, now we will just hop onto some methods to deal with these NaN values.
In many cases, NaN values just mean 0. Remember to read your data description carefully to ensure this assumption.
Note that filling with 0 only works for numerical variables.
new_df = df.copy() new_df[['num1', 'num2', 'num3']] = new_df[['num1', 'num2', 'num3']].fillna(0) print(new_df)
   num1  num2  num3    cat1    cat2  target
0   1.0   0.0   7.0     pen     NaN     4.5
1   1.2   2.0   0.0    book     NaN     6.0
2   5.0   7.0   6.0  pencil     NaN     9.5
3   0.0   3.0   0.0    book  google     8.0
4   4.0   9.0   3.0     NaN     NaN     8.5
5   3.5   8.0   2.0    book     NaN     7.0
6   4.0   2.0   0.0     pen     NaN     NaNThis is one of the most popular ways that many researchers prefer, because:
Note that mean and median are only used for numerical variables, while the mode is usually used for categorical variables.
# fill every NaN with the mean of non-NaN values on its column. # Note: this works only for numerical columns new_df = df.fillna(df.mean()) print(new_df)
       num1      num2  num3    cat1    cat2  target
0  1.000000  5.166667   7.0     pen     NaN    4.50
1  1.200000  2.000000   0.0    book     NaN    6.00
2  5.000000  7.000000   6.0  pencil     NaN    9.50
3  3.116667  3.000000   3.6    book  google    8.00
4  4.000000  9.000000   3.0     NaN     NaN    8.50
5  3.500000  8.000000   2.0    book     NaN    7.00
6  4.000000  2.000000   3.6     pen     NaN    7.25# fill NaN by median, but for only specified column. new_df = df.copy() new_df['num1'] = new_df['num1'].fillna(new_df['num1'].median()) print(new_df)
   num1  num2  num3    cat1    cat2  target
0  1.00   NaN   7.0     pen     NaN     4.5
1  1.20   2.0   0.0    book     NaN     6.0
2  5.00   7.0   6.0  pencil     NaN     9.5
3  3.75   3.0   NaN    book  google     8.0
4  4.00   9.0   3.0     NaN     NaN     8.5
5  3.50   8.0   2.0    book     NaN     7.0
6  4.00   2.0   NaN     pen     NaN     NaN# fill by mode # Note: This method is normally used with categorical varibles only. new_df = df.copy() new_df['cat1'] = new_df['cat1'].fillna(new_df['cat1'].mode()[0]) print(new_df)
   num1  num2  num3    cat1    cat2  target
0   1.0   NaN   7.0     pen     NaN     4.5
1   1.2   2.0   0.0    book     NaN     6.0
2   5.0   7.0   6.0  pencil     NaN     9.5
3   NaN   3.0   NaN    book  google     8.0
4   4.0   9.0   3.0    book     NaN     8.5
5   3.5   8.0   2.0    book     NaN     7.0
6   4.0   2.0   NaN     pen     NaN     NaNWe may create a dictionary to map each column to a default value.
map_values = {'num1' : 3, 'num2' : 6, 'num3' : 0, 
              'cat1' : 'notebook', 'cat2' : 'microsoft'}
new_df = df.fillna(map_values)
print(new_df)
   num1  num2  num3      cat1       cat2  target
0   1.0   6.0   7.0       pen  microsoft     4.5
1   1.2   2.0   0.0      book  microsoft     6.0
2   5.0   7.0   6.0    pencil  microsoft     9.5
3   3.0   3.0   0.0      book     google     8.0
4   4.0   9.0   3.0  notebook  microsoft     8.5
5   3.5   8.0   2.0      book  microsoft     7.0
6   4.0   2.0   0.0       pen  microsoft     NaNIf the column’s values are categorical, we can just assign NaN to a new value that does not coincide with any others in the column.
new_df = df.copy()
new_df['cat1'] = new_df['cat1'].fillna('no_info')
print(new_df)
   num1  num2  num3     cat1    cat2  target
0   1.0   NaN   7.0      pen     NaN     4.5
1   1.2   2.0   0.0     book     NaN     6.0
2   5.0   7.0   6.0   pencil     NaN     9.5
3   NaN   3.0   NaN     book  google     8.0
4   4.0   9.0   3.0  no_info     NaN     8.5
5   3.5   8.0   2.0     book     NaN     7.0
6   4.0   2.0   NaN      pen     NaN     NaNThis is often used for time-series data.
Back-fill means taking the value in the former row to fill-in the prior rows, while Forward-fill is the opposite.
# Back-fill new_df = df.fillna(method='bfill') print(new_df)
   num1  num2  num3    cat1    cat2  target
0   1.0   2.0   7.0     pen  google     4.5
1   1.2   2.0   0.0    book  google     6.0
2   5.0   7.0   6.0  pencil  google     9.5
3   4.0   3.0   3.0    book  google     8.0
4   4.0   9.0   3.0    book     NaN     8.5
5   3.5   8.0   2.0    book     NaN     7.0
6   4.0   2.0   NaN     pen     NaN     NaN# Forward-fill new_df = df.fillna(method='ffill') print(new_df)
   num1  num2  num3    cat1    cat2  target
0   1.0   NaN   7.0     pen     NaN     4.5
1   1.2   2.0   0.0    book     NaN     6.0
2   5.0   7.0   6.0  pencil     NaN     9.5
3   5.0   3.0   6.0    book  google     8.0
4   4.0   9.0   3.0    book  google     8.5
5   3.5   8.0   2.0    book  google     7.0
6   4.0   2.0   2.0     pen  google     7.0Be very careful when applying this method. When you remove anything from your data, you can potentially be missing some useful information that cannot be compensated.
However, cases by cases, sometimes you need to delete some rows or columns to not let them ruin the good value of your data. For example: when you have thousands of samples with perfect values (no NaN), but only 2 samples that have NaN on almost all columns, and you further find out that these NaN are produced by an error in your sensors, then you should delete these 2 rows. Another case you can consider is when your response variable contains several NaN, if removing these samples does not affect your data distribution, you should probably delete them.
# We delete column 'cat2' because it contains 6 NaNs over 7 samples. new_df = df.drop(['cat2'], axis=1) print(new_df)
   num1  num2  num3    cat1  target
0   1.0   NaN   7.0     pen     4.5
1   1.2   2.0   0.0    book     6.0
2   5.0   7.0   6.0  pencil     9.5
3   NaN   3.0   NaN    book     8.0
4   4.0   9.0   3.0     NaN     8.5
5   3.5   8.0   2.0    book     7.0
6   4.0   2.0   NaN     pen     NaN# We delete the last row because its response value is NaN new_df = df[df.target.notna()] print(new_df)
   num1  num2  num3    cat1    cat2  target
0   1.0   NaN   7.0     pen     NaN     4.5
1   1.2   2.0   0.0    book     NaN     6.0
2   5.0   7.0   6.0  pencil     NaN     9.5
3   NaN   3.0   NaN    book  google     8.0
4   4.0   9.0   3.0     NaN     NaN     8.5
5   3.5   8.0   2.0    book     NaN     7.0| Test your understanding | 
|  |