# Naive Bayes classifier: a comprehensive guide

Naive Bayes is a very popular algorithm in Machine Learning given its simplicity, rigid performance and interesting idea from Bayesian Probability.

In this blog post, we show and explain the Bayes formula, how to build a Naive Bayes classifier (with your pen and paper or with Python-sklearn), the assumptions behind this method together with its strengths and weaknesses.

The curriculum is as below:

Table of content

Naive Bayes method

Naive Bayes classifier algorithm relies on Bayes’ theorem about the probability of an event given prior knowledge related to it. The theorem is simply: where:

• A and B are 2 events.
• is the conditional probability of A given B is true.
• is the conditional probability of B given A is true.
• and are the probabilities of A and B, respectively.

This formula is true because (where denotes the joint probability of both events A and B – the probability that both A and B are true).

In the context of Machine Learning and Classification, we can rewrite the formula as below: where:

• is the i-th class. with k is the number of classes.
• X is the training data. Note that X represents 1 sample data point (not the entire dataset).
• is the probability that the sample data point X belongs to class i.
• is the probability of the data point X to appear in class i.
• is the probability of X.

To find the most suitable class for each data point X, we find all the for , and then we assign X to the class with the highest probability.

In other words, we find arg max of .

Or say, arg max of Notice that the value of is constant across all classes. So, we can safely ignore it and adjust our goal to finding arg max of .  is easily computed by taking the proportion of class i in our whole dataset. , where n is the sample size and is the number of data samples that belong to class i.  requires a bit more work. Be reminded that X is a vector of values, ( , where m is the number of predictor variables) is the value of the predictor variable j for this data point X.

We make an assumption here that the predictor variables are all independent of each other.
Thus, .

If the j-th predictor variable is a categorical variable, can be simply calculated from the training dataset. In case it is a numerical variable, we may either distribute the values into bins and then each bin is treated as a categorical value, or assume its distribution (e.g. Gaussian distribution) and then take the PDF.

Example

Suppose our task is to classify the type of animal given its fur color and the sharpness of its claws. The training data looks like below:

## Naive Bayes example training data

No.Fur colorSharpness of clawsAnimal
1BlackHighCat
2YellowLowCat
3BlackHighCat
4BlueHighCat
5YellowLowDog
6BlueLowDog
7YellowHighDog
8BlackLowDog

and here is the testing data:

## Naive Bayes example testing data

No.Fur colorSharpness of clawsAnimal
9YellowLow?
10BlackHigh?

We now pre-compute some values with the information provided by the training data. These values are then used to predict the samples in the testing set.

First, we need the probability of each class.

P(Cat) = = 0.5

P(Dog) = = 0.5

Then, we compute the conditional probability of each category given a class.

P(Fur = Black | Cat) = P(Fur = Black | Dog) = P(Fur = Yellow | Cat) = P(Fur = Yellow | Dog) = etc.

Until we have all the conditional probabilities P(category | class) as in the below table.

## Naive Bayes example P(category | class)

ClassFurSharpness
BlackYellowBlueHighLow
Cat0.50.250.250.750.25
Dog0.250.50.250.250.75

We have done the training part. Now, it’s time for predictions. We will predict the classes of 2 testing examples shown above.

For the first testing sample (the sample with index No. 9):  As we know is a positive number (because is the probability of , and the probability of an already-happened-event is always larger than 0), we have .

Thus, according to the Naive Bayes algorithm, sample No. 9 is classified as Dog.

For the second testing sample (the sample with index No. 10):  Hence, we conclude the sample No.10 is classified as Cat.

Laplace Smoothing

In practice, for categorical variables, it sometimes happens that a category in the testing set does not appear in the training set (e.g. in the above example, what if, in sample No.9, the fur color is Pink instead of Yellow?). In these cases, the results we output are all 0, which gives no clue about which class we should assign the data point to.

To resolve this issue, we often apply Laplace smoothing to the computation of the probabilities.

In other words, instead of the conventional formula: we use the Laplace estimate: where:

• is the value of the sample at the i-th feature.
• is the j-th categorical value of feature i.
• is the k-th class.
• is the number of samples in that has .
• is the number of samples of class k.
• is the number of possible values for .

For example, if we apply to the problem above:

P(Fur = Black | Cat) = Naive Bayes with Python

To run the Naive Bayes algorithm in Python, we could use the pre-built functions from sklearn library, which can be found in sklearn.naive_bayes:

Assumptions

• Features (predictor variables) are independent of each other.
• The solution for handling continuous data (binning or assuming a distribution) is suitable.

Strengths

• Surprisingly simple. It is just counting the values.
• Naive Bayes is a generative model. It is comfortable with just a small amount of data.
• It is fast and requires less Memory.
• Moderate performance in general, even better if the assumptions hold.
• Innate ability to handling missing value (just ignore that feature).
• Output probabilistic predictions (i.e. it does not just classify the data but even produces numerical values representing how likely a data point belongs to each class).
• Naive Bayes is not sensitive to noisy data.

Weaknesses

• The assumption of feature independent rarely holds true.
• It is quite hard to handle continuous feature properly.
• Naive Bayes treats each feature independently, thus it does not learn the interaction between features. For example, Tony does not want to go outside when he is in a bad mood or it is raining, but he likes to go out if both conditions are true.
• The performance is not the best (compared to other algorithms) in most cases.
• Do not support online-learning. When you have new data, you have to re-train the Naive Bayes model.
• Do not work (well) if some values in the testing set do not appear in the training set.

References:

• Toronto Naive Bayes Classification Lecture 3: link
• Wiki – Naive Bayes classifier: link
• Scikit-learn Naive Bayes: link
• Quora – Advantages of Naive Bayes: link
• Edwin Chen’s Blog about Classifiers: link