Build a Fake News Detector in Python with Machine Learning | by Shradhdha Bhalodia | Apr, 2025


In today’s digital world, the spread of fake news on social media and the internet has become a major concern. Detecting whether a piece of news is real or fake is a powerful application of Natural Language Processing (NLP) and Machine Learning (ML).

In this article, we will build a fake news classifier using an inbuilt dataset from scikit-learn — no external downloads required! We’ll walk through cleaning text, converting it into a machine-readable format using TF-IDF, training a model, and making predictions.

We’ll simulate fake vs real news classification using the 20 Newsgroups dataset. This dataset contains 20 categories of newsgroup documents. For our problem, we’ll treat certain categories as “real news” and others as “fake or suspicious news”.

🎯 Goal: Classify a news article as Real (1) or Fake (0).

We’ll use:

  • Python
  • scikit-learn (for dataset, preprocessing, modeling)
  • Matplotlib & Seaborn (for visualization)
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import re
import string

from sklearn.datasets import fetch_20newsgroups
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

We’ll choose 2 categories for binary classification:

  • talk.politics.misc, talk.politics.guns, alt.atheismFake (0)
  • sci.space, sci.med, sci.electronicsReal (1)
categories_fake = ['talk.politics.misc', 'talk.politics.guns', 'alt.atheism']
categories_real = ['sci.space', 'sci.med', 'sci.electronics']

# Load the data
fake_data = fetch_20newsgroups(subset='train', categories=categories_fake, remove=('headers', 'footers', 'quotes'))
real_data = fetch_20newsgroups(subset='train', categories=categories_real, remove=('headers', 'footers', 'quotes'))

# Create dataframes
df_fake = pd.DataFrame({'text': fake_data.data, 'label': 0})
df_real = pd.DataFrame({'text': real_data.data, 'label': 1})

# Combine and shuffle
df = pd.concat([df_fake, df_real], ignore_index=True)
df = df.sample(frac=1, random_state=42).reset_index(drop=True)

print(df.head())

def clean_text(text):
text = text.lower()
text = re.sub(r"http\S+", "", text)
text = re.sub(r"\d+", "", text)
text = re.sub(r"[^\w\s]", "", text)
text = re.sub(r"\s+", " ", text).strip()
return text

df['cleaned_text'] = df['text'].apply(clean_text)

X = df['cleaned_text']
y = df['label']

# Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Vectorize with TF-IDF
tfidf = TfidfVectorizer(stop_words='english', max_df=0.7)
X_train_tfidf = tfidf.fit_transform(X_train)
X_test_tfidf = tfidf.transform(X_test)

We’ll use a Logistic Regression model — simple, fast and effective for text classification.

model = LogisticRegression()
model.fit(X_train_tfidf, y_train)
y_pred = model.predict(X_test_tfidf)

print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))

# Confusion Matrix
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=['Fake', 'Real'], yticklabels=['Fake', 'Real'])
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')
plt.show()

def predict_news(text):
text_clean = clean_text(text)
text_vec = tfidf.transform([text_clean])
prediction = model.predict(text_vec)[0]
return "Real News ✅" if prediction == 1 else "Fake News ❌"

# Example
sample_news = "NASA just launched a new satellite to monitor climate change."
print(predict_news(sample_news))

# Example of a fake news headline/text
sample_news = "Aliens have contacted the government and are living in Area 51 secretly."

print(predict_news(sample_news))

  • Accuracy typically above 85%
  • TF-IDF + Logistic Regression worked well for this binary classification
  • Can be extended to more categories or used with more powerful models

Want to go further?

  • Try SVM or Naive Bayes
  • Add N-grams to TF-IDF
  • Use deep learning (BERT, LSTM)
  • Deploy with Streamlit or Flask

This project shows how you can build a fake news classifier using just inbuilt tools and data from sklearn. It’s a great starting point for NLP projects that solve real-world problems.

Recent Articles

Related Stories

Leave A Reply

Please enter your comment!
Please enter your name here