Implementation of Naïve Bayesian Classifier Model to Classify a Set of Documents and to Measure the Accuracy, Precision, and Recall | by Kavyavarshini | May, 2024


In this extensive blog post, we will delve into the implementation of a Naïve Bayesian classifier for classifying a set of documents. This journey will encompass the preprocessing of text data, the training of the classifier, and the evaluation of its performance using metrics such as accuracy, precision, and recall. By the end, you will have a comprehensive understanding of how to implement and evaluate a Naïve Bayesian classifier for document classification.

The Naïve Bayesian classifier is a probabilistic machine learning model used for classification tasks. It is based on Bayes’ Theorem and assumes that the features are conditionally independent given the class label. Despite this “naïve” assumption, it performs remarkably well in various applications, especially in text classification.

For our implementation, we will use the 20 Newsgroups dataset, a collection of approximately 20,000 newsgroup documents partitioned across 20 different newsgroups. This dataset is commonly used for experiments in text applications of machine learning techniques, such as text classification and text clustering.

First, let’s load the dataset and preprocess the text data. Text preprocessing involves several steps: tokenization, removal of stop words, stemming/lemmatization, and vectorization.

import pandas as pd
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, precision_score, recall_score, confusion_matrix, classification_report
# Load the dataset
newsgroups = fetch_20newsgroups(subset='all')
data = newsgroups.data
target = newsgroups.target
print(f"Loaded {len(data)} documents.")# Preprocess the data
vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(data)
print(f"Transformed text data into a {X.shape[0]}x{X.shape[1]} matrix.")

Next, we split the dataset into training and testing sets. This allows us to train the model on one portion of the data and evaluate its performance on another.

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, target, test_size=0.25, random_state=42)
print(f"Training set has {X_train.shape[0]} samples.")
print(f"Testing set has {X_test.shape[0]} samples.")

We will use the MultinomialNB classifier from sklearn, which is suitable for classification with discrete features, such as word counts.

# Train the Naïve Bayesian classifier
classifier = MultinomialNB()
classifier.fit(X_train, y_train)
print("Training complete.")

After training the model, we use it to make predictions on the test data.

# Make predictions on the test set
y_pred = classifier.predict(X_test)
print("Predictions complete.")

To evaluate the model, we will calculate the accuracy, precision, and recall. These metrics provide a comprehensive understanding of the model’s performance.

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
# Calculate precision and recall
precision = precision_score(y_test, y_pred, average='weighted')
recall = recall_score(y_test, y_pred, average='weighted')
print(f"Accuracy: {accuracy}")
print(f"Precision: {precision}")
print(f"Recall: {recall}")
# Print a detailed classification report
print("Classification Report:")
print(classification_report(y_test, y_pred, target_names=newsgroups.target_names))
  • Accuracy: The ratio of correctly predicted instances to the total instances.
  • Precision: The ratio of correctly predicted positive observations to the total predicted positives. It is a measure of the accuracy of the positive predictions.
  • Recall: The ratio of correctly predicted positive observations to the all observations in actual class. It is a measure of the model’s ability to capture all the positive samples.

Let’s provide a more detailed example that includes comments and explanations for each step. This example will serve as a comprehensive guide for anyone looking to implement a Naïve Bayesian classifier for document classification.

# Import necessary libraries
import pandas as pd
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, precision_score, recall_score, confusion_matrix, classification_report
# Step 1: Load the dataset
newsgroups = fetch_20newsgroups(subset='all')
data = newsgroups.data
target = newsgroups.target
print(f"Loaded {len(data)} documents.")# Step 2: Preprocess the data
# Convert the raw text data into TF-IDF feature vectors
vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(data)
print(f"Transformed text data into a {X.shape[0]}x{X.shape[1]} matrix.")# Step 3: Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, target, test_size=0.25, random_state=42)
print(f"Training set has {X_train.shape[0]} samples.")
print(f"Testing set has {X_test.shape[0]} samples.")
# Step 4: Train the Naïve Bayesian classifier
classifier = MultinomialNB()
classifier.fit(X_train, y_train)
print("Training complete.")# Step 5: Make predictions on the test set
y_pred = classifier.predict(X_test)
print("Predictions complete.")# Step 6: Evaluate the model
# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
# Calculate precision and recall
precision = precision_score(y_test, y_pred, average='weighted')
recall = recall_score(y_test, y_pred, average='weighted')
print(f"Accuracy: {accuracy}")
print(f"Precision: {precision}")
print(f"Recall: {recall}")
# Print a detailed classification report
print("Classification Report:")
print(classification_report(y_test, y_pred, target_names=newsgroups.target_names))

In this blog post, we have walked through the comprehensive process of implementing a Naïve Bayesian classifier to classify a set of documents. We started by loading and preprocessing the dataset, followed by training the classifier, making predictions, and evaluating the model’s performance using accuracy, precision, and recall. This method is powerful for text classification tasks and can be applied to various other domains with minimal modifications.

By following the steps outlined, you should be able to implement your own Naïve Bayesian classifier for document classification and evaluate its performance effectively. The Naïve Bayesian model, despite its simplicity, provides robust results and is a valuable tool in the machine learning toolkit.

Recent Articles

Related Stories

Leave A Reply

Please enter your comment!
Please enter your name here