An introduction to the Naive Bayes classifier and how to use it | by Roberto Cadili | Low Code for Data Science | Nov, 2024

DATA SCIENCE THEORY | MACHINE LEARNING | KNIME ANALYTICS PLATFORM

A theoretical explanation and an implementation using KNIME’s visual workflows

About the author:

Alberto Montanari has a degree in Mechanical Engineering and spent his career with the Fiat Chrysler Automobiles Group as President of many companies around the world. After his retirement from FCA, he dedicated his interest to data mining and analytics with Python and KNIME. As a KNIME Certified Trainer, he is spreading the culture of low code in Bologna, Northern Italy, and organizing KNIME courses for management (supported by Federmanager).

Naive Bayes is a fundamental algorithm in machine learning and artificial intelligence, widely used for classification tasks.

To understand the Naive Bayes classifier (which divides data into classes/groups), we start with Bayes’ theorem, a fundamental concept of probability named after the 18th-century English mathematician Thomas Bayes. This theorem helps us calculate the probability of an event occurring, based on the occurrence of a related event.

The term “naive” refers to the algorithm’s key assumption: all features used in the prediction process are assumed to be conditionally independent. In practice, naive Bayes’ feature independence assumption is often violated. Yet, this assumption helps simplify computations and often holds surprisingly well in practical applications.

The Naive Bayes classifier is a supervised machine learning algorithm designed to assign a label or category to an object based on its features. Here’s a summarized overview:

· Training phase: The algorithm analyzes a dataset with labeled examples (e.g., emails classified as “spam” or “not spam”) and calculates the probabilities of features for each target class.

· Prediction phase: For new data (e.g., a new email), the algorithm uses the calculated probabilities to determine the most likely target class.

Using the Naive Bayes algorithm for classification offers numerous advantages with a few limitations.

Advantages

Simplicity: It is easy to implement and requires relatively short training time compared to more complex algorithms.
Computational Efficiency: Requires fewer computational resources compared to more complex algorithms.
Strong Performance: Despite assuming independence between features, it often delivers accurate results.
Ease of Interpretation: It is very clear for each feature how much it contributes towards a certain class prediction, since we can interpret the conditional probability.
Versatility: It can be applied across various fields and to different types of problems.

Limitations

Independence Assumption: In reality, features may be correlated. For instance, in text, the presence of “snow” and “cold” is likely related.
Categorical Data: It works better with categorical data than with continuous numerical data, though modified versions exist to handle this limitation.

The Naive Bayes classifier is used in many real-world scenarios:

Spam Filtering: Many email services use this algorithm to identify and filter unwanted emails. By analyzing keywords and common patterns in spam, the system can automatically classify incoming messages.
Sentiment Analysis: In online reviews or social media, the classifier can determine if a comment is positive, negative, or neutral, helping businesses gauge public opinion about their products or services.
Medical Diagnosis: In healthcare, it can help predict the likelihood of a patient having a specific disease based on symptoms and test results.
Recommendation Systems: Platforms like Netflix or Amazon can use this algorithm to suggest movies, TV shows, or products that might interest users based on their previous behavior.

To better illustrate how the Naive Bayes classifier works, we’ll start with a simple practical example, but first, let’s recap the key theoretical concepts:

Prior Probability: The initial probability of an event occurring before observing any evidence. For example, the probability that a patient has the flu without considering their symptoms.
Conditional (Posterior) Probability: The probability of an event occurring given that another event has already occurred. For instance, the probability that a patient has a fever given that they have the flu.
Bayes’ Theorem statement: Simply put, the probability of A given B equals the probability of B given A, multiplied by the probability of A, divided by the probability of B.

Explanation of Terms

P(A|B): This is the probability that event A is true given that B is true. In other words, if we know B has occurred (e.g., the email contains the word “offer”), we want to determine how likely it is that A is true (the email is spam).
P(B|A): This is the probability that B is true if A is true. So, if we know the email is spam (A), this probability tells us how likely it is to contain the word “offer” (B).
P(A): This represents the prior probability of A, meaning how likely it is that an email is spam before considering specific features (B).
P(B): This is the prior probability of B, which refers to how likely it is to observe feature B in general (“offer”), regardless of A.

With these concepts in mind, let’s work through a practical example to see the Naive Bayes classifier in action, followed by an example using KNIME, the free and open-source software for data science.

Imagine predicting the winner of Formula 1 races based on circuit speed. The contingency table below shows fictitious results for Ferrari and McLaren on slow and fast circuits. Using probability calculations, Naive Bayes can predict outcomes for different scenarios.

Let’s assume Ferrari performs better on slower circuits, while McLaren excels on faster ones.

Next, let’s calculate the frequency distributions by dividing each value by the total number of observations (20):

Then, we calculate the probabilities:

Marginal Probabilities:

P(A) = 0.45
P(A’) = 0.55
P(B) = 0.65
P(B’) = 0.35

Joint Probabilities (A and B together):

P(A ⋂ B) = 0.20
P(A’ ⋂ B) = 0.45
P(A ⋂ B’) = 0.25
P(A’ ⋂ B’) = 0.10

Conditional Probabilities:

P(A|B) = P(A ⋂ B) / P(B) = 0.20 / 0.65 = 0.31
P(A’|B) = P(A’ ⋂ B) / P(B) = 0.45 / 0.65 = 0.69
P(A|B’) = P(A ⋂ B’) / P(B’) = 0.25 / 0.35 = 0.71
P(A’|B’) = P(A’ ⋂ B’) / P(B’) = 0.10 / 0.35 = 0.29

Applying Bayes’ Theorem:

Let’s calculate P(B|A), which is the probability that a fast circuit was used given Ferrari’s win:

P(B|A) = [P(B) * P(A|B)] / [P(B) * P(A|B) + P(B’) * P(A|B’)]

→ P(B|A) = (0.65 * 0.31) / [(0.65 * 0.31) + (0.35 * 0.71)] = 0.45

The calculations of other probabilities are left as an exercise. In this case, the example is simple and could have been solved directly from the table. However, with more rows and columns, the process becomes significantly more complex.

KNIME simplifies machine learning tasks like implementing the Naive Bayes classifier thanks to its visual workflows.

In this example, we’ll build a Naive Bayes classifier to distinguish red and white wines based on their chemical properties.

After partitioning the dataset with stratified sampling in train (80%) and test set (20%), we apply the Learner-Predictor paradigm to obtain predictions. In particular, the Naive Bayes Learner and Predictor nodes.

For comparison purposes, we can also train a logistic regression algorithm, using the same Learner-Predictor paradigm.