A dummy variable is a variable (or feature, predictor, column) whose values can be either 0 or 1.
Sometimes, when doing analysis and making inferences, we find ourselves adding some dummy variables to our data. In this blog post, we are listing out the most common cases when creating dummy variables can have a good impact on our mining process.
When doing one-hot encoding
One-hot encoding (or dummy encoding) is the process of transforming a categorical feature to a list of dummy features, with each of these dummies represents a unique value of the original categorical feature. The main aim of this method is to convert a categorical variable into a numerical one since categorical variables are not accepted as the input of many learning algorithms.
We also have a separate post on how to convert a categorical variable into numerical for your interest.
When we want to record the NaN status
The struggle with NaN values (or Null, None) maybe one of the most miserable experiences we have with cleaning data (this sits just behind the fault of recording wrong or biases observations). The causes of NaNs in our data may come from a wide range of origins, like recorder’s mistakes (she forget to write down the value as she was into browsing Youtube), sensor misfunctions, an error in parsing that was made in the preceding phase (by ourselves or our teammates), or simply just because the value was actually not available (for example, the feature is about shop’s customer satisfaction, but no one came to the store that day due to a hurricane).
Most of the time, we take care of the NaN values by some methods and then move away. However, if we sense something meaningful in those NaNs, we can keep track of them by making a dummy feature with values indicating if the feature-of-interest is NaN or non-NaN. This is a form of feature extraction that can potentially benefit the predictive models when they train on our data.
When we want to answer a Yes/No question
Asking the data a question and store its answer to this dummy column.
There are many types of binary questions that we can ask. They can be on a single feature (is the value of feature A higher than A’s median? is the value of feature B outside of the two-sigma range from B’s mean?) or multiple features (do feature C and D have the same sign? is the value of feature E larger than F?). The main goal is to extract new information for ourselves to look at and gain knowledge, make inferences.
When we want to define or emphasize
Sometimes, the fact is subtle and can not be deduced by the algorithms, so we make a dummy variable to emphasize it. This type of question produces new information for the algorithms, thus promisingly enhances their performance. For example, in a dataset about student performance in schools, we have a feature indicating their final grade point in the range from 1 to 10, we can make a new dummy feature pointing out if that grade point implies Distinction () or not ().
|Test your understand|
We have just enumerated 4 situations when adding a new dummy variable is reasonable and beneficial. Those are: