Build a Fake News Detector in Python with Machine Learning | by Shradhdha Bhalodia | Apr, 2025

In today’s digital world, the spread of fake news on social media and the internet has become a major concern. Detecting whether a piece of news is real or fake is a powerful application of Natural Language Processing (NLP) and Machine Learning (ML).

In this article, we will build a fake news classifier using an inbuilt dataset from scikit-learn — no external downloads required! We’ll walk through cleaning text, converting it into a machine-readable format using TF-IDF, training a model, and making predictions.

We’ll simulate fake vs real news classification using the 20 Newsgroups dataset. This dataset contains 20 categories of newsgroup documents. For our problem, we’ll treat certain categories as “real news” and others as “fake or suspicious news”.

🎯 Goal: Classify a news article as Real (1) or Fake (0).

We’ll use:

Python
scikit-learn (for dataset, preprocessing, modeling)
Matplotlib & Seaborn (for visualization)

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import re
import stringfrom sklearn.datasets import fetch_20newsgroups
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

We’ll choose 2 categories for binary classification:

talk.politics.misc, talk.politics.guns, alt.atheism → Fake (0)
sci.space, sci.med, sci.electronics → Real (1)

categories_fake = ['talk.politics.misc', 'talk.politics.guns', 'alt.atheism']
categories_real = ['sci.space', 'sci.med', 'sci.electronics']# Load the data
fake_data = fetch_20newsgroups(subset='train', categories=categories_fake, remove=('headers', 'footers', 'quotes'))
real_data = fetch_20newsgroups(subset='train', categories=categories_real, remove=('headers', 'footers', 'quotes'))


# Create dataframes
df_fake = pd.DataFrame({'text': fake_data.data, 'label': 0})
df_real = pd.DataFrame({'text': real_data.data, 'label': 1})
# Combine and shuffle
df = pd.concat([df_fake, df_real], ignore_index=True)
df = df.sample(frac=1, random_state=42).reset_index(drop=True)
print(df.head())

def clean_text(text):
text = text.lower()
text = re.sub(r"http\S+", "", text)
text = re.sub(r"\d+", "", text)
text = re.sub(r"[^\w\s]", "", text)
text = re.sub(r"\s+", " ", text).strip()
return textdf['cleaned_text'] = df['text'].apply(clean_text)

X = df['cleaned_text']
y = df['label']# Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Vectorize with TF-IDF
tfidf = TfidfVectorizer(stop_words='english', max_df=0.7)
X_train_tfidf = tfidf.fit_transform(X_train)
X_test_tfidf = tfidf.transform(X_test)

We’ll use a Logistic Regression model — simple, fast and effective for text classification.

model = LogisticRegression()
model.fit(X_train_tfidf, y_train)

y_pred = model.predict(X_test_tfidf)print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))
# Confusion Matrix
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=['Fake', 'Real'], yticklabels=['Fake', 'Real'])
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')
plt.show()

def predict_news(text):
text_clean = clean_text(text)
text_vec = tfidf.transform([text_clean])
prediction = model.predict(text_vec)[0]
return "Real News ✅" if prediction == 1 else "Fake News ❌"# Example
sample_news = "NASA just launched a new satellite to monitor climate change."
print(predict_news(sample_news))

# Example of a fake news headline/text
sample_news = "Aliens have contacted the government and are living in Area 51 secretly."print(predict_news(sample_news))

Accuracy typically above 85%
TF-IDF + Logistic Regression worked well for this binary classification
Can be extended to more categories or used with more powerful models

Want to go further?

Try SVM or Naive Bayes
Add N-grams to TF-IDF
Use deep learning (BERT, LSTM)
Deploy with Streamlit or Flask

This project shows how you can build a fake news classifier using just inbuilt tools and data from sklearn. It’s a great starting point for NLP projects that solve real-world problems.

Build a Fake News Detector in Python with Machine Learning | by Shradhdha Bhalodia | Apr, 2025

Recent Articles

Meta wins $168M judgment against spyware seller NSO Group – Computerworld

On-Scroll 3D Carousel | Codrops

Regression Discontinuity Design: How It Works and When to Use It

A Step-by-Step Guide to Implement Intelligent Request Routing with Claude

Best Garmin deal: Save $100 on Garmin Forerunner 965

Related Stories

Leave A Reply Cancel reply