Implementation of NaÃ¯ve Bayesian Classifier Model to Classify a Set of Documents and to Measure the Accuracy, Precision, and Recall | by Kavyavarshini | May, 2024

In this extensive blog post, we will delve into the implementation of a NaÃ¯ve Bayesian classifier for classifying a set of documents. This journey will encompass the preprocessing of text data, the training of the classifier, and the evaluation of its performance using metrics such as accuracy, precision, and recall. By the end, you will have a comprehensive understanding of how to implement and evaluate a NaÃ¯ve Bayesian classifier for document classification.

The NaÃ¯ve Bayesian classifier is a probabilistic machine learning model used for classification tasks. It is based on Bayesâ Theorem and assumes that the features are conditionally independent given the class label. Despite this ânaÃ¯veâ assumption, it performs remarkably well in various applications, especially in text classification.

For our implementation, we will use the 20 Newsgroups dataset, a collection of approximately 20,000 newsgroup documents partitioned across 20 different newsgroups. This dataset is commonly used for experiments in text applications of machine learning techniques, such as text classification and text clustering.

First, letâs load the dataset and preprocess the text data. Text preprocessing involves several steps: tokenization, removal of stop words, stemming/lemmatization, and vectorization.

import pandas as pd
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, precision_score, recall_score, confusion_matrix, classification_report

# Load the dataset
newsgroups = fetch_20newsgroups(subset='all')
data = newsgroups.data
target = newsgroups.targetprint(f"Loaded {len(data)} documents.")# Preprocess the data
vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(data)print(f"Transformed text data into a {X.shape[0]}x{X.shape[1]} matrix.")

Next, we split the dataset into training and testing sets. This allows us to train the model on one portion of the data and evaluate its performance on another.

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, target, test_size=0.25, random_state=42)

print(f"Training set has {X_train.shape[0]} samples.")
print(f"Testing set has {X_test.shape[0]} samples.")

We will use the MultinomialNB classifier from sklearn, which is suitable for classification with discrete features, such as word counts.

# Train the NaÃ¯ve Bayesian classifier
classifier = MultinomialNB()
classifier.fit(X_train, y_train)

print("Training complete.")

After training the model, we use it to make predictions on the test data.

# Make predictions on the test set
y_pred = classifier.predict(X_test)

print("Predictions complete.")

To evaluate the model, we will calculate the accuracy, precision, and recall. These metrics provide a comprehensive understanding of the modelâs performance.

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)

# Calculate precision and recall
precision = precision_score(y_test, y_pred, average='weighted')
recall = recall_score(y_test, y_pred, average='weighted')print(f"Accuracy: {accuracy}")
print(f"Precision: {precision}")
print(f"Recall: {recall}")# Print a detailed classification report
print("Classification Report:")
print(classification_report(y_test, y_pred, target_names=newsgroups.target_names))

Accuracy: The ratio of correctly predicted instances to the total instances.
Precision: The ratio of correctly predicted positive observations to the total predicted positives. It is a measure of the accuracy of the positive predictions.
Recall: The ratio of correctly predicted positive observations to the all observations in actual class. It is a measure of the modelâs ability to capture all the positive samples.

Letâs provide a more detailed example that includes comments and explanations for each step. This example will serve as a comprehensive guide for anyone looking to implement a NaÃ¯ve Bayesian classifier for document classification.

# Import necessary libraries
import pandas as pd
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, precision_score, recall_score, confusion_matrix, classification_report

# Step 1: Load the dataset
newsgroups = fetch_20newsgroups(subset='all')
data = newsgroups.data
target = newsgroups.targetprint(f"Loaded {len(data)} documents.")# Step 2: Preprocess the data
# Convert the raw text data into TF-IDF feature vectors
vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(data)print(f"Transformed text data into a {X.shape[0]}x{X.shape[1]} matrix.")# Step 3: Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, target, test_size=0.25, random_state=42)print(f"Training set has {X_train.shape[0]} samples.")
print(f"Testing set has {X_test.shape[0]} samples.")# Step 4: Train the NaÃ¯ve Bayesian classifier
classifier = MultinomialNB()
classifier.fit(X_train, y_train)print("Training complete.")# Step 5: Make predictions on the test set
y_pred = classifier.predict(X_test)print("Predictions complete.")# Step 6: Evaluate the model
# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)# Calculate precision and recall
precision = precision_score(y_test, y_pred, average='weighted')
recall = recall_score(y_test, y_pred, average='weighted')print(f"Accuracy: {accuracy}")
print(f"Precision: {precision}")
print(f"Recall: {recall}")# Print a detailed classification report
print("Classification Report:")
print(classification_report(y_test, y_pred, target_names=newsgroups.target_names))

In this blog post, we have walked through the comprehensive process of implementing a NaÃ¯ve Bayesian classifier to classify a set of documents. We started by loading and preprocessing the dataset, followed by training the classifier, making predictions, and evaluating the modelâs performance using accuracy, precision, and recall. This method is powerful for text classification tasks and can be applied to various other domains with minimal modifications.

By following the steps outlined, you should be able to implement your own NaÃ¯ve Bayesian classifier for document classification and evaluate its performance effectively. The NaÃ¯ve Bayesian model, despite its simplicity, provides robust results and is a valuable tool in the machine learning toolkit.

Implementation of NaÃ¯ve Bayesian Classifier Model to Classify a Set of Documents and to Measure the Accuracy, Precision, and Recall | by Kavyavarshini | May, 2024

Recent Articles

So, you don’t have a chief information security officer? 9 signs your company needs one

Exploring Music Transcription with Multi-Modal Language Models | by Jon Flynn | Nov, 2024

Fine-tune multimodal models for vision and text use cases on Amazon SageMaker JumpStart

Critical WordPress Plugin Vulnerability Exposes Over 4 Million Sites

Looking as Good as New, This MacBook Air Is Cheaper Than a Pair of AirPods Pro

Related Stories

Leave A Reply Cancel reply