Naive Bayes Explained

Multinomial Naïve Bayes

Let us see how naïve bayes works by performing a movie review sentimental analysis task.

Let’s say we have a small movie review dataset which looks like this:

Movie Review

Sentiment

Review 1

Good

Review 2

Bad

Review 3

Good

Review 4

Good

Review 5

Bad

Now I want to use Naïve Bayes algorithm to predict the sentiment of the movie review “Amazing movie”

1) Find Prior Probability

First we look at the small movie review dataset and count the frequency of good and bad movies

Sentiment

Frequency

Good

3

Bad

2

Probability of good reviews

P(Good) = Number of good reviews / Total reviews = 0.6

Probability of bad reviews

P(Bad) = Number of bad reviews / Total reviews = 0.4

2) Conditional Probability

Now we count the frequency of each words in the movie review dataset and put them into a table.

Good Reviews

Words

Frequency

Amazing

8

Romance

3

Sleepy

1

Movie

3

Bad Reviews

Words

Frequency

Amazing

2

Romance

0

Sleepy

6

Movie

2

We then compute the conditional probability for each words

For example:

P(Amazing | Good) = Frequency of the word “Amazing” in good reviews / total number of words in good reviews

Word

Frequency

P(Word | Good)

Amazing

8

8 / 15 = 0.5333

Romance

3

3 / 15 = 0.2

Horrible

1

1/15 = 0.0667

Movie

3

3/15 = 0.2

We do the same for bad reviews as well

Words

Frequency

P(Word | Bad)

Amazing

2

2/10 = 0.2

Romance

0

0/10=0

Horrible

6

6/10=0.6

Movie

2

2/10=0.2

3) Now we can make predict the sentiment of the review

“Amazing Movie”

P(Good | “Amazing Movie”) = P(Good) X P(Amazing | Good) X P(Movie | Good) = 0.6 X 0.5333 X 0.2 = 0.064

P(Bad | “Amazing Movie”) = P(Bad) X P(Amazing | Bad) X P(Movie | Bad) = 0.4 X 0.2 X 0.2 = 0.016

We can clearly see that P(Good | “Amazing Movie”) > P(Bad | “Amazing Movie”), so the Naïve Bayes algorithm will predict the review “Amazing Movie” as positive sentiment

Laplace Smoothing

What about the review “Horrible Romance Movie”, lets analyze the sentiment using Naïve Bayes algorithm

P(Good | “Horrible Romance Movie”)

= P(Good) X P(Horrible | Good) X P(Romance | Good) X P(Movie | Good)

= 0.6 X 0.0667 X 0.2 X 0.2

= 0.0016

P(Bad | “Horrible Romance Movie”)

= P(Bad) X P(Horrible | Bad) X P(Romance | Bad) X P(Movie | Bad)

= 0.4 X 0.6 X 0 X 0.2

= 0

So P(Good | “Horrible Romance Movie”) > P(Bad | “Horrible Romance Movie”) hence the review “Horrible Romance movie” is a positive sentiment ?!

Hmm.. something is not quite right.

The problem is that there are no “Romance” words in bad reviews so P(Romance | Bad) = 0 hence the result will always be 0.

We can address this problem by using a technique called Laplace smoothing

We increase every word frequency by 1 to get rid of the 0, then we calculate the new conditional probability.

Positive reviews

Word

Frequency

P(Word | Good)

Amazing

8 + 1 = 9

9 / 19 = 0.4737

Romance

3 + 1 = 4

4 / 19 = 0.2105

Horrible

1 + 1 = 2

2/19 = 0.1053

Movie

3 + 1 = 4

4/19 = 0.2105

Negative Reviews

Words

Frequency

P(Word | Bad)

Amazing

2 +1 = 3

3/14 = 0.2143

Romance

0 + 1 = 1

1/14=0.0714

Horrible

6 + 1 = 7

7/14=0.5

Movie

2 + 1 = 3

3/14=0.2143

Let’s find the sentiment of “Horrible Romance Movie” again to see if we get a different result after Laplace Smoothing

P(Good | “Horrible Romance Movie”)

= P(Good) X P(Horrible | Good) X P(Romance | Good) X P(Movie | Good)

= 0.6 X 0.1053 X 0.2105 X 0.2105

= 0.0028

P(Bad | “Horrible Romance Movie”)

= P(Bad) X P(Horrible | Bad) X P(Romance | Bad) X P(Movie | Bad)

= 0.4 X 0.5 X 0.0714 X 0.2143

= 0.0031

Since P(Bad | “Horrible Romance Movie”) > P(Good | “Horrible Romance Movie”), hence the review “Horrible Romance Movie” will be classified as negative review which is correct.

Gaussian Naïve Bayes

We use Gaussian Naïve Bayes to find the conditional probability of continuous features (e.g: weight, height).

We assume that these features are normally distributed (aka Gaussian distribution – hence the name)

Let’s use Gaussian Naïve Bayes to classify whether a person is male or female based on their height and weight. Suppose we are given a dataset of 6 person.

Height (cm)

Weight (kg)

Gender

171

65

Male

175

75

Male

180

83

Male

165

50

Female

171

55

Female

163

52

Female

Now we want to predict whether a person with height 173cm and weight 80kg is a male or female

1) Find prior probability
P(Male) = Number of male / total number of person

= 3/6 = 0.5

P(Female) = Number of female / total number of person

= 3/6 = 0.5

2) Find the mean and variance of each feature

Mean of height (male) = 171 + 175 + 180 / 3 =  175.33

Standard Deviation of height (male) = 4.5093

Mean of height (female) = 166.33

Standard Deviation of height (female) = 4.1633

Mean of weight

(male) = 74.33

(female) = 52.33

Std Dev of weight

(male) = 9.0185

(female) = 2.5166

3) P(Male | height = 173cm, weight = 80kg)

= P(Male) X P(height=173 | Male) X P(weight=80 | Male)

=