Naive Bayes classifier: a comprehensive guide

A beautiful sight
Test your knowledge
0%

Naive Bayes classifier: a comprehensive guide - Quiz 1

1 / 6

Do the Naive Bayes' assumptions often hold?

2 / 6

What is the formula of Bayes' theorem?

3 / 6

Is Naive Bayes a generative or discriminative algorithm?

4 / 6

What best describes the Naive Bayes?

5 / 6

Is Naive Bayes robust against outliers?

6 / 6

Is Naive Bayes robust against missing values?

Your score is

0%

Please rate this quiz

Naive Bayes is a very popular algorithm in Machine Learning given its simplicity, rigid performance and interesting idea from Bayesian Probability.

In this blog post, we show and explain the Bayes formula, how to build a Naive Bayes classifier (with your pen and paper or with Python-sklearn), the assumptions behind this method together with its strengths and weaknesses.

The curriculum is as below:

Table of content

Naive Bayes method
Example
Laplace smoothing
Naive Bayes with Python
Assumptions
Strengths
Weakness

Naive Bayes method

Naive Bayes classifier algorithm relies on Bayes’ theorem about the probability of an event given prior knowledge related to it. The theorem is simply:

P(A|B) = \frac{P(B|A)P(A)}{P(B)}

where:

  • A and B are 2 events.
  • P(A|B) is the conditional probability of A given B is true.
  • P(B|A) is the conditional probability of B given A is true.
  • P(A) and P(B) are the probabilities of A and B, respectively.

This formula is true because
P(A|B)P(B) = P(B|A)P(A) = P(A, B)
(where P(A, B) denotes the joint probability of both events A and B – the probability that both A and B are true).

In the context of Machine Learning and Classification, we can rewrite the formula as below:

P(c_i|X) = \frac{P(X|c_i)P(c_i)}{P(X)}

where:

  • c_i is the i-th class. 1 \leq i \leq k with k is the number of classes.
  • X is the training data. Note that X represents 1 sample data point (not the entire dataset).
  • P(c_i|X) is the probability that the sample data point X belongs to class i.
  • P(X|c_i) is the probability of the data point X to appear in class i.
  • P(X) is the probability of X.

To find the most suitable class for each data point X, we find all the P(c_i|X) for 1 \leq i \leq k, and then we assign X to the class with the highest probability.

In other words, we find arg max _i of P(c_i|X).

Or say, arg max _i of \frac{P(X|c_i)P(c_i)}{P(X)}

Notice that the value of P(X) is constant across all classes. So, we can safely ignore it and adjust our goal to finding arg max _i of P(X|c_i)P(c_i).

\blacktriangleright P(c_i) is easily computed by taking the proportion of class i in our whole dataset.

P(c_i) = \frac{|X \in c_i|}{n}, where n is the sample size and |X \in c_i| is the number of data samples that belong to class i.

\blacktriangleright P(X|c_i) requires a bit more work. Be reminded that X is a vector of values, X_j (1 \leq j \leq m, where m is the number of predictor variables) is the value of the predictor variable j for this data point X.

We make an assumption here that the predictor variables are all independent of each other.
Thus, P(X|c_i) = P(X_1|c_i)*P(X_2|c_i)* ... *P(X_m|c_i).

If the j-th predictor variable is a categorical variable, P(X_j|c_i) can be simply calculated from the training dataset. In case it is a numerical variable, we may either distribute the values into bins and then each bin is treated as a categorical value, or assume its distribution (e.g. Gaussian distribution) and then take the PDF.

Example

Suppose our task is to classify the type of animal given its fur color and the sharpness of its claws. The training data looks like below:

Naive Bayes example training data

No.Fur colorSharpness of clawsAnimal
1BlackHighCat
2YellowLowCat
3BlackHighCat
4BlueHighCat
5YellowLowDog
6BlueLowDog
7YellowHighDog
8BlackLowDog

and here is the testing data:

Naive Bayes example testing data

No.Fur colorSharpness of clawsAnimal
9YellowLow?
10BlackHigh?

We now pre-compute some values with the information provided by the training data. These values are then used to predict the samples in the testing set.

First, we need the probability of each class.

P(Cat) = \frac{\text{number of cats}}{\text{total sample size}} = \frac{4}{8} = 0.5

P(Dog) = \frac{\text{number of dogs}}{\text{total sample size}} = \frac{4}{8} = 0.5

Then, we compute the conditional probability of each category given a class.

P(Fur = Black | Cat) = \frac{\text{number of cats with black fur}}{\text{number of cats}} = \frac{2}{4} = 0.5

P(Fur = Black | Dog) = \frac{\text{number of dogs with black fur}}{\text{number of dogs}} = \frac{1}{4} = 0.25

P(Fur = Yellow | Cat) = \frac{\text{number of cats with yellow fur}}{\text{number of cats}} = \frac{1}{4} = 0.25

P(Fur = Yellow | Dog) = \frac{\text{number of dogs with yellow fur}}{\text{number of dogs}} = \frac{2}{4} = 0.5

etc.

Until we have all the conditional probabilities P(category | class) as in the below table.

Naive Bayes example P(category | class)

ClassFurSharpness 
BlackYellowBlueHighLow
Cat0.50.250.250.750.25
Dog0.250.50.250.250.75

We have done the training part. Now, it’s time for predictions. We will predict the classes of 2 testing examples shown above.

For the first testing sample (the sample with index No. 9):

\begin{aligned}P(Cat | X^{(9)}) &= \frac{P(X^{(9)}|Cat)P(Cat)}{P(X^{(9)})} \\                             &= \frac{ P(Yellow|Cat) P(Low|Cat) P(Cat) }{P(X^{(9)})} \\                             &= \frac{0.25 * 0.25 * 0.5}{P(X^{(9)})} \\                             &= \frac{0.03125}{P(X^{(9)})} \\\end{aligned}

\begin{aligned}P(Dog | X^{(9)}) &= \frac{P(X^{(9)}|Dog)P(Dog)}{P(X^{(9)})} \\                             &= \frac{ P(Yellow|Dog) P(Low|Dog) P(Dog) }{P(X^{(9)})} \\                             &= \frac{0.5 * 0.75 * 0.5}{P(X^{(9)})} \\                             &= \frac{0.1875}{P(X^{(9)})} \\\end{aligned}

As we know P(X^{(9)}) is a positive number (because P(X^{(9)}) is the probability of X^{(9)}, and the probability of an already-happened-event is always larger than 0), we have P(Dog | X^{(9)}) > P(Cat | X^{(9)}).

Thus, according to the Naive Bayes algorithm, sample No. 9 is classified as Dog.

For the second testing sample (the sample with index No. 10):

\begin{aligned}P(Cat | X^{(10)}) &= \frac{P(X^{(10)}|Cat)P(Cat)}{P(X^{(10)})} \\                             &= \frac{ P(Black|Cat) P(High|Cat) P(Cat) }{P(X^{(10)})} \\                             &= \frac{0.5 * 0.75 * 0.5}{P(X^{(10)})} \\                             &= \frac{0.1875}{P(X^{(10)})} \\\end{aligned}

\begin{aligned}P(Dog | X^{(10)}) &= \frac{P(X^{(10)}|Dog)P(Dog)}{P(X^{(10)})} \\                             &= \frac{ P(Black|Dog) P(High|Dog) P(Dog) }{P(X^{(10)})} \\                             &= \frac{0.25 * 0.25 * 0.5}{P(X^{(10)})} \\                             &= \frac{0.03125}{P(X^{(10)})} \\\end{aligned}

Hence, we conclude the sample No.10 is classified as Cat.

Laplace Smoothing

In practice, for categorical variables, it sometimes happens that a category in the testing set does not appear in the training set (e.g. in the above example, what if, in sample No.9, the fur color is Pink instead of Yellow?). In these cases, the results we output are all 0, which gives no clue about which class we should assign the data point to.

To resolve this issue, we often apply Laplace smoothing to the computation of the probabilities.

In other words, instead of the conventional formula:

P(X_{i} = v_j | c_k) = \frac{n_{ijk}}{n_k}

we use the Laplace estimate:

P(X_{i} = v_j | c_k) = \frac{n_{ijk} + 1}{n_k + s_i}

where:

  • X_{i} is the value of the sample at the i-th feature.
  • v_j is the j-th categorical value of feature i.
  • c_k is the k-th class.
  • n_{ijk} is the number of samples in c_k that has X_{i} = v_j.
  • n_k is the number of samples of class k.
  • s_i is the number of possible values for X_{i} .

For example, if we apply to the problem above:

P(Fur = Black | Cat) = \frac{\text{cats with black fur + 1}}{\text{cats + 3}} = \frac{2 + 1}{4 + 3} = \frac{3}{7}

Naive Bayes with Python

To run the Naive Bayes algorithm in Python, we could use the pre-built functions from sklearn library, which can be found in sklearn.naive_bayes:

Assumptions

  • Features (predictor variables) are independent of each other.
  • The solution for handling continuous data (binning or assuming a distribution) is suitable.

Strengths

  • Surprisingly simple. It is just counting the values.
  • Naive Bayes is a generative model. It is comfortable with just a small amount of data.
  • It is fast and requires less Memory.
  • Moderate performance in general, even better if the assumptions hold.
  • Innate ability to handling missing value (just ignore that feature).
  • Output probabilistic predictions (i.e. it does not just classify the data but even produces numerical values representing how likely a data point belongs to each class).
  • Naive Bayes is not sensitive to noisy data.

Weaknesses

  • The assumption of feature independent rarely holds true.
  • It is quite hard to handle continuous feature properly.
  • Naive Bayes treats each feature independently, thus it does not learn the interaction between features. For example, Tony does not want to go outside when he is in a bad mood or it is raining, but he likes to go out if both conditions are true.
  • The performance is not the best (compared to other algorithms) in most cases.
  • Do not support online-learning. When you have new data, you have to re-train the Naive Bayes model.
  • Do not work (well) if some values in the testing set do not appear in the training set.
Test your understanding
0%

Naive Bayes classifier: a comprehensive guide - Quiz 2

1 / 9

What are the advantages of using the Naive Bayes predictive model? Choose all that apply.

2 / 9

Why is this equation true?

Screenshot From 2020 03 23 07 09 31

3 / 9

Regarding the Naive Bayes predictive method, how do we handle the probability of data sample, i.e. P(X)?

4 / 9

Does multicollinearity affect the predictive power of Naive Bayes?

5 / 9

What are the assumptions of the Naive Bayes predictive model? Choose all that apply.

6 / 9

Can Naive Bayes learn the association between predictor variables?

7 / 9

Regarding the Naive Bayes method, P(data sample | class) * P(class) can be interpreted as the probability that data sample belongs to that class, is it true?

8 / 9

Is Laplace smoothing often used in Naive Bayes?

9 / 9

What best describes multinominal Naive Bayes?

Your score is

0%

Please rate this quiz

References:

  • Toronto Naive Bayes Classification Lecture 3: link
  • Wiki – Naive Bayes classifier: link
  • Scikit-learn Naive Bayes: link
  • Quora – Advantages of Naive Bayes: link
  • Edwin Chen’s Blog about Classifiers: link

Leave a Reply