Naïve Bayes Classifier Mathematical Intuition

JR
5 min readApr 25, 2024

Naive Bayes is an algorithm based on Bayes’ Theorem. Despite the “naive” assumption of feature independence, these classifiers are widely utilized for their simplicity and efficiency in machine learning.

Data Science, Machine Learning

Assumption of Naive Bayes (Geeks for Geeks, 2024)

The fundamental Naive Bayes assumption is that each feature makes an:

  • Feature independence: The features of the data are conditionally independent of each other, given the class label.
  • Continuous features are normally distributed: If a feature is continuous, then it is assumed to be normally distributed within each class.
  • Discrete features have multinomial distributions: If a feature is discrete, then it is assumed to have a multinomial distribution within each class.
  • Features are equally important: All features are assumed to contribute equally to the prediction of the class label.
  • No missing data: The data should not contain any missing values.

Bayes’ Theorem

Bayes’ Theorem finds the probability of an event occurring given the probability of another event that has already occurred. Bayes’ theorem is stated mathematically as the following equation:

where A and B are events and P(B) ≠ 0

Note: To find the probability of numerical data such that y = P(X | Y):

Classification with Bayes’ Theorem: In the context of classification, Bayes’ theorem is used to calculate the probability of a class label given a set of features. Let’s denote:

  • 𝐶 as the class label.
  • 𝑋 as the set of features.

Then, the probability of class 𝐶 given features 𝑋 is given by:

probability of gaussian distribution

Where:

  • 𝑃(𝐶∣𝑋) is the posterior probability of class 𝐶 given features 𝑋.
  • 𝑃(𝑋∣𝐶) is the likelihood of observing features 𝑋 given class 𝐶.
  • 𝑃(𝐶) is the prior probability of class 𝐶.
  • 𝑃(𝑋) is the probability of observing features 𝑋, also known as the evidence.

Naive Bayes makes a strong assumption that all features in 𝑋 are conditionally independent given the class label 𝐶. This means that the presence of a particular feature in a class is independent of the presence of any other feature.

Mathematically, this assumption can be expressed as:

Where 𝑥𝑖 represents the 𝑖-th feature in 𝑋.

With the naive assumption, the classification rule simplifies to selecting the class label 𝐶C that maximizes the posterior probability 𝑃(𝐶∣𝑋)

Where Ĉ is the predicted class label.

Example. Solve the following question.

Naive Bayes Question

Construct a model using Naïve Bayes to decide if a day is suitable to play tennis. The table below shows the results whether to play tennis, based on Outlook, Temperature and Wind.

Solution.

Types of Naive Bayes:

  • Gaussian Naive Bayes: Assumes that features follow a Gaussian distribution.
  • Multinomial Naive Bayes: Suitable for features with discrete counts, like word counts in text classification.

Gaussian Naïve Bayes

In Gaussian Naive Bayes, continuous values associated with each feature are assumed to be distributed according to a Gaussian distribution (Geeks for Geeks, 2024). A Gaussian distribution is also called Normal distribution. When plotted, it gives a bell shaped curve which is symmetric about the mean of the feature values as shown below:

Gaussian Distribution

The likelihood of the features is assumed to be Gaussian, hence, conditional probability is given by:

Laplace Smoothing

When one of the categorical feature has zero numerator or one of the numerical feature has zero or extremely small variance, Naïve Bayes may perform poorly. To address this issue, we come across a technique called Laplace Smoothing.

In Naïve Bayes, if a particular feature value (in categorical features) or a particular range of values (in numerical features) does not appear in the training data for a given class, the conditional probability for that feature given the class will be zero. This can cause issues during classification because a zero probability will effectively make the entire posterior probability zero, regardless of the other evidence.

Laplace smoothing addresses these issues by adding a small constant value to the count of each feature value in categorical features or a small constant value to the variance in numerical features. This ensures that even if a feature value has not been observed in the training data, it still has a non-zero probability assigned to it.

The formula for Laplace smoothing in categorical features is:

Where:

  • is the smoothed conditional probability of feature value 𝑥𝑖xi​ given class 𝑦y.
  • d is the number of unique values of the feature.
  • 𝛼 is the smoothing parameter (usually set to 1).

Similarly, for numerical features with Gaussian (normal) distribution, Laplace smoothing can be applied by adding a small constant value to the variance:

Laplace Smoothing

Where:

  • ​ is the smoothed variance.
  • is the original variance.
  • 𝛼 is the smoothing parameter.

By applying Laplace smoothing, Naïve Bayes classifiers become more robust and less sensitive to unseen feature values or extremely small variances, leading to improved performance, especially in cases where the training data is limited.

Advantages of Naive Bayes Classifier

  • Easy to implement and computationally efficient.
  • Effective in cases with a large number of features.
  • Performs well even with limited training data.
  • It performs well in the presence of categorical features.
  • For numerical features data is assumed to come from normal distributions

Disadvantages of Naive Bayes Classifier

  • Assumes that features are independent, which may not always hold in real-world data.
  • Can be influenced by irrelevant attributes.
  • May assign zero probability to unseen events, leading to poor generalization.

Conclusion

In conclusion, Naive Bayes classifiers, despite their simplified assumptions, prove effective in various applications, showcasing notable performance in document classification and spam filtering. Their efficiency, speed, and ability to work with limited data make them valuable in real-world scenarios, compensating for their naive independence assumption.

Contributions and collaborations are welcome; feel free to reach out with any questions or collaboration ideas!

Contact: reallyhat@gmail.com

--

--

JR

“By doing the work to love ourselves more, I believe we will love each other better.”