Data Cleaning case study: Google Play Store Dataset

This post attempts to give readers a practical example of how to clean a dataset. The data we wrangle with today is named Google Play Store Apps, which is a simply-formatted CSV-table with each row representing an application.

Dataset Name: Google Play Store Apps
Dataset Source: Kaggle
Task: Data cleaning
Language: Python

Column description

Overall, there are 13 columns:

App: Application name.
Category: Category the app belongs to.
Rating: Overall user rating of the app (as when scraped).
Reviews: Number of user reviews for the app (as when scraped).
Size: Size of the app (as when scraped).
Installs: Number of user downloads/installs for the app (as when scraped).
Type: Paid or Free.
Price: Price of the app (as when scraped).
Content Rating: Age group the app is targeted at – Children / Mature 21+ / Adult.
Genres: An app can belong to multiple genres (apart from its main category). For eg, a musical family game will belong to Music, Game, Family genres.
Last Updated: Date when the app was last updated on Play Store (as when scraped).
Current Ver: Current version of the app available on Play Store (as when scraped).
Android Ver: Min required Android version (as when scraped).

(copied from the data source.)

Data Cleaning

Load and take an overview

# import library
from copy import deepcopy
import re
import numpy as np
import pandas as pd

# load data
df_playstore = pd.read_csv('../data/googleplaystore.csv')

In general, to have an overview of the data frame, I would print out the following information:

data shape: the number of instances and features.
several data rows: to have a sense of the values each data point may contain.
data types of the columns.
common statistics of the data frame (using .describe() method)
Missing-value status (using .isna() method).
The number of unique values for each column.

print('df_playstore.shape:', df_playstore.shape)

df_playstore.shape: (10841, 13)

print('Some first rows of the data:')
display(df_playstore.head())

print('Column data-types:')
display(df_playstore.dtypes)

Column data-types:
App                object
Category           object
Rating            float64
Reviews            object
Size               object
Installs           object
Type               object
Price              object
Content Rating     object
Genres             object
Last Updated       object
Current Ver        object
Android Ver        object
dtype: object

print('General statistics of the dataframe:')
display(df_playstore.describe(include='all'))

print('Missing-value status:')
display(df_playstore.isna().sum())

Missing-value status:
App                  0
Category             0
Rating            1474
Reviews              0
Size                 0
Installs             0
Type                 1
Price                0
Content Rating       1
Genres               0
Last Updated         0
Current Ver          8
Android Ver          3
dtype: int64

print('Number of unique values for each column:')
df_playstore.nunique()

Number of unique values for each column:
App               9660
Category            34
Rating              40
Reviews           6002
Size               462
Installs            22
Type                 3
Price               93
Content Rating       6
Genres             120
Last Updated      1378
Current Ver       2832
Android Ver         33
dtype: int64

Remove duplicates

The above shows that we have 10841 rows, however, only 9660 of their App names are unique. This raises a question: Do the duplicated names refer to the same app or not?

If the Play Store restricts the app’s names that different apps must have different names, those duplicates in the dataset are duplicated data points and should be handled so that only 1 of the duplicates remains.

However, it turns out that Google does allow apps with exactly the same name, except for the names that were declared as trademarks, in which case, the names will be protected by the laws. However, we have no clue if any names from these applications have been registered as a trademark, so, to maintain data integrity, we assume that every name in this dataset is duplicatable (i.e. many apps may have the same name).

In other words, we should not delete a row just because its app name is identical to another row. Nevertheless, it is unrealistic to think that 2 different apps may be the same in every property, from the name, rating, reviews to size, etc. So, we remove the duplicates that coincide with another one in all the listed features.

df_playstore = df_playstore.drop_duplicates(keep='first')
print('df_playstore.shape:', df_playstore.shape)

df_playstore.shape: (10358, 13)

Column: Last Updated

This should obviously be a Date Time column, yet, pandas recognizes it as of object type (the default data type). This implies there is an issue with this column, let’s fix it.

A simple effort to cast this column to datetime gets the following error:

pd.to_datetime(df_playstore['Last Updated'])

ParserError: day is out of range for month: 1.0.19

There seem to be some rows with ill-formatted values. The below code will show those rows out.

tmp_cast = pd.to_datetime(df_playstore['Last Updated']
                          , errors='coerce')
df_playstore[tmp_cast.isna()]

The problem is clear: in row 10472, there is a missing for the value at the Category. To solve this, we shift all the values of this row to the right:

row_missing_value = deepcopy(df_playstore.loc[10472])
row_missing_value[1:] = row_missing_value[1:].shift(periods=1)
df_playstore.loc[10472] = row_missing_value

# remove the temporary variable
del row_missing_value

On a side note, shifting the values of this row to right also solves the problem you may have noticed at the beginning of this work, that the maximum value of the Rating is perceived to be 19.0.

Now, how about the category of this app? Well, I do a search on the Play Store and take it that this app belongs to the LIFESTYLE group. Hoping that this app’s category hasn’t changed from 11 Feb 2018, we fill it in:

df_playstore.loc[10472, 'Category'] = 'LIFESTYLE'

Ok, so we can convert the Last Updated column into datetime without any errors.

df_playstore['Last Updated'] = (
    pd.to_datetime(df_playstore['Last Updated'])
)

Column: Rating

The rating is already in the form of a float-like string, we only need to make a call to convert it.

df_playstore.Rating = df_playstore.Rating.astype(float)

Column: Reviews

Similarly, the Reviews column is also ready to be cast to the Integer type.

df_playstore['Reviews'] = df_playstore['Reviews'].astype(int)

Column: Size

Amongst all the apps in this dataset, none has its size reach 1GB. All the values we have to parse either end with ‘M’ (Megabyte), ‘k’ (kilobyte) or equal to ‘Varies with device’. The below script verifies that all of our 10358 rows fall into 1 of these 3 options.

sum((df_playstore['Size'].str.match('\d+\.?\d*M').sum(),
     df_playstore['Size'].str.match('\d+\.?\d*k').sum(),
     df_playstore['Size'].str.match('Varies with device').sum()
    ))

When converting these values into float, we must have them in the same unit, either MB or kB is fine. In this notebook, I choose to use the Megabyte unit, thus each app’s size that was originally in kB will be divided by 1024.

The last problem is: How to handle the ‘Varies with device‘ value. Technically, this is not so much different from NaN as we have no information about the actual size of the application. However, the take is we know that these apps have different versions available at once.

To conclude, we set the rows with ‘Varies with device‘ NaN and make a new dummy column named Variable Size to better distinguish them.

# copy the desired column to a new variable
#   and work on this new variable
#   to avoid unnecessary errors for the main data frame.
df_playstore_size = deepcopy(df_playstore['Size'])

# get the indexes of rows for each type of size.
index_size_M = df_playstore_size.str.endswith('M')
index_size_k = df_playstore_size.str.endswith('k')
index_size_unknown = df_playstore_size.str.match('Varies with device')

# replace the values of the copy.
df_playstore_size[index_size_M] = (
    df_playstore_size[index_size_M].apply(
        lambda x: float(x[:-1])
    )
)
df_playstore_size[index_size_k] = (
    df_playstore_size[index_size_k].apply(
        lambda x : float(x[:-1]) / 1024
    )
)
df_playstore_size[index_size_unknown] = np.nan

# apply the replaced value to the main data frame.
df_playstore['Size'] = df_playstore_size.astype(float)

# make a new column to emphasize the apps with
#   variable size.
df_playstore['Variable Size'] = (
    df_playstore['Size'].isna().astype(int)
)

# delete temporary variables.
del df_playstore_size
del index_size_M, index_size_k, index_size_unknown

Column: Installs

The number of installs is shown by buckets. Let’s see which buckets are there:

df_playstore['Installs'].value_counts()

1,000,000+        1488
10,000,000+       1132
100,000+          1129
10,000+           1033
1,000+             891
100+               710
5,000,000+         683
500,000+           517
50,000+            474
5,000+             469
10+                385
100,000,000+       369
500+               328
50,000,000+        272
50+                204
5+                  82
1+                  67
500,000,000+        61
1,000,000,000+      49
0+                  14
0                    1

The smallest values are ‘0’, ‘0+’, and ‘1+’. This is quite surprising to me that both ‘0’ and ‘0+’ exist even though we need only one of them. After a bit of searching on the internet, I couldn’t find the official source of Play Store about these bucket counts. However, as stated by most other sources, ‘x+’ has the lower bound as x+1. That is, ’50+’ means 51-100, ‘100+’ means 101-500, and so on. Thus, it makes sense to deduce that ‘0’ means 0, ‘0+’ means 1, ‘1+’ means 2-5, etc. (Please correct me if I’m wrong.)

Follow the above rule, we convert this column to Integer type:

# copy the desired column to a new variable
#   and work on this new variable
#   to avoid unnecessary errors for the main data frame.
df_playstore_installs = deepcopy(df_playstore.Installs)

# get the indexes of the rows that are 
#   not ready to be converted to int.
index_install_plus = df_playstore_installs.str.endswith('+')

# convert to int
df_playstore_installs[index_install_plus] = (
    df_playstore_installs[index_install_plus].apply(
        lambda x : int(x[:-1].replace(',', '')) + 1
    )
)

# apply the change to the main data frame
#   and rename the column to make it more clear.
df_playstore['Installs'] = df_playstore_installs.astype(int)
df_playstore = df_playstore.rename(
    columns={'Installs' : 'Installs at least'}
)

# delete the temporary variables.
del df_playstore_installs, index_install_plus

Column: Type

Apart from Free and Paid, this column also has 1 NaN, which is shown below:

df_playstore[df_playstore['Type'].isna()]

Notice that this is also the only row with 0 Install, which makes me feel there might be something wrong with this record.

Further investigation shows that this game seems to appear on Play Store later than the recorded time in the dataset. Wikipedia says it was first on Play Store from 2018-12-04, which is after the Last Updated value 2018-06-28.

Because of this ambiguity, I decide to remove this row from the data.

df_playstore = df_playstore.drop(9148)

This column, Type, is then replaced by ‘Is Free’ – a dummy variable.

df_playstore['Is Free'] = (
    df_playstore.Type == 'Free'
).astype(int)

df_playstore = df_playstore.drop(['Type'], axis=1)

Column: Price

The price of an app seems to follow this rule: ‘0’ if it is free, else a dollar sign followed by a floating-point number. Let’s check if there are any exceptions:

print (sum(df_playstore.Price.str.match('^$\d+\.?\d*$'))
       + sum(df_playstore.Price == '0')
      )

There is no exception. Great. We can go straight to the conversion code:

# copy the desired column to a new variable
#   and work on this new variable
#   to avoid unnecessary errors for the main data frame.
df_playstore_price = deepcopy(df_playstore.Price)

# get the indexes of the rows that are 
#   not ready to be converted to float.
index_price_dollar = (
    df_playstore.Price.str.match('^$\d+\.?\d*$')
)

# do the conversion.
df_playstore_price[index_price_dollar] = (
    df_playstore_price[index_price_dollar].str[1:].astype(float)
)

# apply the change to the main data frame.
df_playstore.Price = df_playstore_price

# delete temporary variables
del df_playstore_price, index_price_dollar

Note that although this Price column does cover the information of the Is Free column, I decide to keep both of them for now to emphasize the difference between free and paid apps. This use of dummy variables is elaborated in the post about when should we add a dummy variable.

Later, in case we are concerned with multicollinearity or similar issues, we would remove Is Free if necessary.

Column: Content Rating

Let’s take a look at all the content rating tags:

df_playstore['Content Rating'].unique()

array(['Everyone', 'Teen', 'Everyone 10+', 'Mature 17+',
       'Adults only 18+', 'Unrated'], dtype=object)

These tags seem to follow the standard of America.

We have 2 problems with this column: How should Unrated be understood and How to handle this ordinal variable.

Unrated apps will not be shown when doing content filtering. If you set up parental controls to restrict apps and games to a certain rating, you won’t see any Unrated apps in the Play Store. Given that in mind, should we treat Unrated apps as NaN or a number larger than 18?

To have a clearer view of the situation, let’s check the distribution of the values:

df_playstore['Content Rating'].value_counts()

Everyone           8383
Teen               1146
Mature 17+          447
Everyone 10+        376
Adults only 18+       3
Unrated               2
Name: Content Rating, dtype: int64

There are only 2 unrated apps in the dataset. As the number is so small, we have the option to remove them if we think doing so will not bias our data mining process.

For now, I decide to keep these rows to not affect the completeness of the dataset. ‘Unrated’ will be converted to NaN in the following step.

The Content Rating is obviously an ordinal variable. And Ordinal variables can be either transformed into categorical or interval variables. In this case, which one should we choose?

A detailed tutorial on how to treat ordinal variables is given in another post. In short, for practical purposes, ordinal variables are recommended to be converted to interval in most cases. We will stick with this for now.

# map the values of Content Rating to numbers
#   and store them in another column.
df_playstore['Minimum Age'] = df_playstore['Content Rating'].map(
    {'Everyone' : 0,
     'Everyone 10+' : 10,
     'Teen' : 13,
     'Mature 17+' : 17,
     'Adults only 18+' : 18,
     'Unrated' : np.nan
    }
)

# remove the old Content Rating,
# we already have Minimum Age as a replacement.
df_playstore = df_playstore.drop(['Content Rating'], axis=1)

Note that aside from converting tags to the corresponding ages, we also have the option to convert them to ranking, i.e. ‘Everyone’ is mapped to 0, ‘Everyone 10+’ to 1, ‘Teen’ to 2, etc.

Column: Android Ver

Look at the value counts:

df_playstore['Android Ver'].value_counts()

4.1 and up            2379
4.0.3 and up          1451
4.0 and up            1338
Varies with device    1220
4.4 and up             894
2.3 and up             643
5.0 and up             546
4.2 and up             387
2.3.3 and up           279
2.2 and up             239
3.0 and up             237
4.3 and up             235
2.1 and up             133
1.6 and up             116
6.0 and up              58
7.0 and up              42
3.2 and up              36
2.0 and up              32
5.1 and up              22
1.5 and up              20
4.4W and up             11
3.1 and up              10
2.0.1 and up             7
8.0 and up               6
7.1 and up               3
5.0 - 8.0                2
4.0.3 - 7.1.1            2
1.0 and up               2
4.1 - 7.1.1              1
5.0 - 7.1.1              1
5.0 - 6.0                1
7.0 - 7.1.1              1
2.2 - 7.1.1              1

and the missing values:

df_playstore[df_playstore['Android Ver'].isna()]

we have these observations:

There are 2 NaN values.
More than a thousand apps are not specific about their version, i.e. ‘Varies with device‘.
Several apps stopped updating for newer Android versions.

We would:

Have a column ‘Android Major Ver From‘ that stores the minimum major version of Android each app supports. ‘Varies with device‘ will be converted to NaN in this column.
Have a dummy column ‘Variable Android Ver‘ that equals to 1 iff its Android Ver is ‘Varies with device‘.
Have a dummy column ‘Is Still Maintained’ that equals to 1 iff it still supports the latest Android version.

# function to extract the major android version from text.
# the major version starts from the beginning and
#   ends before the first dot (.).
# NaN is returned if cannot extract.
def get_android_major_version(text):
    try:
        ver = int(text[:text.find('.')])
        return ver
    except:
        return np.nan

df_playstore['Android Major Ver From'] = (
    df_playstore['Android Ver'].apply(get_android_major_version)
)

df_playstore['Variable Android Ver'] = (
    (df_playstore['Android Ver'].str
     .match('^Varies with device$') == True
    ).astype(int)
)

# the apps whose Android ver end with 'and up'
#   or equals to 'Varies with device'
#   are considered still maintained
df_playstore['Is Still Maintained'] = (
    df_playstore['Android Ver'].str.endswith('and up') | 
    df_playstore['Android Ver'].str.match(r'Varies with device')
)

# remove the no-longer-wanted column.
df_playstore = df_playstore.drop(['Android Ver'], axis=1)

Note that we by removing ‘Android Ver’, we actually lose some information about the minor Android version. It depends on our purposes on mining this dataset that we decide to extract the minor version or not (in most case, the answer seems to be No).

Column: Current Ver

Some information can be extracted from this column are:

If this is the first version of the app.
If the app has multiple versions for different devices.
The current major version. Note that this might be a bit noisy due to the preference of the developers themselves: with the same change, some app-producers might declare a new major version while others think it is more suitable for a minor version.

It is also worth mentioning that many values in this field do not follow the standard. I list some of those values in the table below. We will treat those as NaNs.

newversion	acremotecontrol18	opciÃ³n de cerrar
Public.Heal	cli-6.5.0	Bowser4Craigslist
T4.6.5-tl2.0.1	closed	Copyright
KM	Cow V3.15	MyMetro_HTML_2.0_430029
a.2	Arrow-202	iu4
H7SubTV0.1_TV0.4	Rocksteady 1.3	DH-Security Camera

# prepare a function that can recognize if
#   a Current Ver value indicates an updated version
#   or the initial version.
# if the input value is 'Varies with device'
#   that means the app has been updated.
# if the input does not follow the standard
#   return NaN
# if the input follows the standard then
#   extract the version and check if it is
#   1 or 1.0 or 1.0.0 etc.
app_ver_std_re = re.compile(r'(?i)(?:v|version)?:? ?(\d.*)')
first_version_re = re.compile(r'1(\.0)*$')

def recognized_has_updated(current_ver):
    if current_ver == 'Varies with device':
        return 1
    match = re.match(app_ver_std_re, current_ver)
    if match:
        version = match.group(1)
        if re.match(first_version_re, version):
            return 0
        else:
            return 1
    else:
        return np.nan

# apply 
df_playstore['Has Ever Updated'] = (
    df_playstore['Current Ver'].apply(
        lambda x : recognized_has_updated(str(x))
    )
)

# make a column to indicate if
#   the app has different versions
#   for different devices.
df_playstore['Variable App Ver'] = (
    df_playstore['Current Ver'].str
    .match(r'^Varies with device$')
    .astype(float)
)

# write a function to extract the major version 
# the regex used this time is very similar
#   to that of the recognized_has_updated,
#   the only difference is on the capturing group.
extract_major_ver_re = re.compile(r'(?i)(?:v|version)?:? ?(\d).*')

def extract_app_major_version(current_ver):
    match = re.match(extract_major_ver_re, current_ver)
    if match:
        major_version = match.group(1)
        return major_version
    return np.nan

# apply
df_playstore['Major App Ver'] = (
    df_playstore['Current Ver'].apply(
        lambda x : extract_app_major_version(str(x))
    )
)

# Current Ver has fulfilled its role.
df_playstore = df_playstore.drop(['Current Ver'], axis=1)

Column: Category and Genres

We will convert these 2 columns to numerical with one-hot encoding. It is lucky that there are no NaNs here. However, note that while an app may only belong to 1 category, there might be multiple genres associate with it.

Handling the Category is simple:

# One-hot encode Category
df_playstore = (
    pd.get_dummies(
        df_playstore, columns=['Category'], prefix='Cat'
    )
)

It involves a bit more work for the Genres:

# cast Genres to string type
df_playstore.Genres = df_playstore.Genres.astype(str)

# split the genres of each app
df_stacked_genres = (
    df_playstore.Genres.str
    .split(';', expand=True).stack()
)

# apply one-hot encoding
df_genres_dummies = (
    pd.get_dummies(df_stacked_genres, prefix='gen')
    .groupby(level=0).sum()
)

# join the one-hot result into the main data frame
df_playstore = (df_playstore.merge(df_genres_dummies, 
                                   left_index=True, 
                                   right_index=True)
               ).drop(['Genres'], axis=1)

# delete temporary variables
del df_stacked_genres, df_genres_dummies

Everything is done. Here are the first 5 rows of the resulting data frame after being cleaned.

What we have done are:

Load and show some basic statistics of the dataset.
Check and fix the problem of missing a comma in this CSV-saved dataset (row 10472).
Re-format data type of columns from string to int or float or datetime accordingly.
Remove an erroneous row.
Make several dummy columns to emphasize some traits (Variable Size, Variable Android Ver, Variable App Ver).
One-hot encode Category and Genres (Genres may contains multiple values per row).

The Jupyter Notebook containing full code is given here.

Tung M Phung's Blog