Fake News Detection using Machine Learning | by Camilo Gonçalves | Jul, 2024


Author: DALL-E

“Data is the new oil.” Since The Economist’s famous story, this phrase has become a common refrain. Setting aside all discussion and criticism about this metaphor, the importance of data as a resource in the digital era is widely acknowledged. Recognizing this statement also implies acknowledging both sides of the coin. Data can be as beneficial as it is dangerous, and in the wrong hands, data and its attendant distortions can become a weapon.

In this project, we’ll develop a Machine Learning model to identify fake news. We’ll use Kaggle’s REAL and FAKE news dataset, which contains a small collection of news articles labeled as REAL or FAKE. This project’s goal is to provide a basic understanding of how to process real text data and use it to solve important current problems.

Natural Language Processing (NLP) combines multiple technologies to enable machine “understanding” of human language on both objective and subjective levels.

I chose NLTK as our main NLP framework. Quoting the framework’s documentation site:

NLTK is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries, and an active discussion forum.

he following section provides a basic understanding of the NLP concepts that we’ll be using in this project.

Refers to the body of the text, and can be in multiple languages. For a collection of texts, we call it a corpora.

Tokenization is the process of breaking down text into individual tokens. Depending on the use case, there are multiple types of tokens, such as:

  • Word Tokenization: The text is broken into words.
  • Sentence Tokenization: The text is broken into individual sentences.
  • Subword Tokenization: Words in the text are broken down into smaller units or subwords.
  • Character Tokenization: The text is broken into individual characters.
Author: Aysel Aydin

Normalization is the process of transforming text into a single canonical form, reducing its randomness and bringing it closer to a predefined standard.

Stemming is the process of reducing words to their root form. For example:

Note that words have been reduced to something that isn’t an actual word. This is because stemming is a heuristic process that achieves normalization by simply stripping the ends of words. It’s fast and generally effective, but it can introduce errors, such as reducing words more than necessary (over-stemming) or less than necessary (under-stemming).

Lemmatization reduces words to their base form, properly handling inflected words to ensure that the root word is an actual word. For example:

It’s generally a more sophisticated process than stemming, and it can be further improved if word contexts are provided. This can be achieved through a process called parts-of-speech (POS) tagging.

According to the Cambridge Dictionary, a lexicon is a list of all the words used in a particular language or subject (a.k.a dictionary). In the context of NLP, a lexicon is a group of words associated with specific features in a certain style (e.g., parts of speech segmentation, sentiment). It is used as a source for interpreting human language by providing special information, meanings, and grammatical properties of those words.

These are commonly used words in a language that, without context (i.e., by themselves), carry very little useful information, like punctuation, articles, pronouns, etc. Depending on the use case, they can be useless and totally removed from analysis or so important that their presence is required. This project shows both examples.

Bag of Words (BoW) is one of the most well-known text feature extraction techniques. Its output is a table with counts of word occurrences in a set of documents.

Author: Aysel Aydin

Term Frequency – Inverse Document Frequency (TF-IDF) is a text feature extraction technique, similar to BoW but, instead of word occurrences, it results in a table with word importances. Word importance is a measure of a word’s frequency in a specific document with respect to the word’s inverse frequency in all documents. This can be mathematically defined as follows:

where:

  • x is a specific word;
  • y is a specific document;
  • tfₓ,ᵧ is the frequency of x in y;
  • df is the number of documents containing x;
  • N is the total number of documents.

Basically, a word is important for a text document if it occurs frequently in that specific document but rarely in all others.

This project will use some of the most well-known libraries for Scientific Computing, Data Visualization, Natural Language Processing, and Machine Learning available for Python:

import os
import re
import time

import joblib
import lightgbm as lgb
import matplotlib.pyplot as plt
import nltk
import pandas as pd
import seaborn as sns
import xgboost as xgb
from nltk.corpus import stopwords
from nltk.sentiment import SentimentIntensityAnalyzer
from nltk.stem.porter import PorterStemmer
from nltk.tokenize import word_tokenize
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import LabelEncoder
from sklearn.svm import SVC
from wordcloud import WordCloud

Let’s begin by loading the data using Pandas. For simplicity’s sake, the code below executes the following operations:

  • Reads the data from the .csv located in data/news.csv.
  • Shuffles the DataFrame, a required step for the predictive model that we’ll build.
  • Resets the index of our newly shuffled data.
  • Drop the unwanted Unnamed: 0 column
  • Prints the DataFrame dimensions sizes.
  • Display the first 5 rows of data.
df = (
pd.read_csv("data/news.csv")
.sample(frac=1)
.reset_index(drop=True)
.drop("Unnamed: 0", axis="columns")
)

print(f"Dataset shape: {df.shape}")
df.head()

The dataset itself contains 6,335 entries and 3 feature columns:

  • title: The news’s title.
  • text: The body text of the news.
  • label: A binary feature indicating if the corresponding news is REAL or FAKE.

All text is in English. Unfortunately, the dataset does not explicitly provide the dates, but it seems that the majority of the news articles are related to the context of the US presidential elections.

Now, before continuing with the analysis, we’ll download and load the data that contains the definitions of punctuation, stop words, and the VADER lexicon that we’ll use. Thanks to the NLTK package, this process is very straightforward.

nltk.download("punkt", download_dir="data") # punctuation
nltk.download("stopwords", download_dir="data") # stop words
nltk.download("vader_lexicon", download_dir="data") # VADER lexicon

# NLTK seem to not unzip the Vader lexicon automatically =[
if not os.path.isfile("data/sentiment/vader_lexicon/vader_lexicon.txt"):
!unzip data/sentiment/vader_lexicon.zip -d data/sentiment

nltk.data.path.append("data")

Our analysis will be driven by the following base questions:

  • How balanced is the data with relation to the news type (fake and real)?
  • How does fake news data differ from real news data?
  • News text size
  • Most common words
  • Overall sentiment

First of all, let’s check the volumetric proportion of fake and real news. A count plot of the type occurrences can answer that.

Real and fake news have almost the same proportions. Good. We don’t have to worry about data imbalance when we reach the Machine Learning model development.

Now let’s check the characteristics of the news data.

The news data itself only has two features: the title and the text body of each news item. For the analysis of text size and word occurrence, the presence of stop words can introduce too much noise in the measurements, so we’ll strip them off. Also, we’ll use the Porter Stemmer to stem the words and give them a standard form.

stop_words = set(stopwords.words("english"))  # we're explicitly loading the stop words for English language.

porter = PorterStemmer()

def preprocess_text(text):
# We'll apply regex substitution to keep alphabetical token only, removing any other kind.
text = re.sub("[^a-zA-Z]", " ", text).lower()
# The default NLTK word tokenizer breaks sentences down by whitespaces and punctuation.
word_tokens = word_tokenize(text)
filtered_sequence = [
porter.stem(word) for word in word_tokens if not word.lower() in stop_words
]
# Join remaining tokens toghether again.
return " ".join(filtered_sequence)

# We'll apply our analysis to both title and text body of the news.
df["title_clean"] = df["title"].apply(preprocess_text)
df["text_clean"] = df["text"].apply(preprocess_text)
df.head()

As we can see, we store the cleaned titles and text bodies in additional feature columns, so we don’t lose the original ones. We’ll need them to apply the sentiment analysis a bit later.

The size analysis consists of just checking the distributions of the pure text lengths of each news type.

df["title_length"] = df["title_clean"].str.len()
df["text_length"] = df["text_clean"].str.len()

plt.figure(figsize=(12, 7))

plt.subplot(2, 2, 1)
sns.boxplot(y="title_length", data=df, hue="label", hue_order=["REAL", "FAKE"])
plt.title("Boxplot of title length by REAL or FAKE titles")

plt.subplot(2, 2, 3)
sns.boxplot(y="text_length", data=df, hue="label", hue_order=["REAL", "FAKE"])
plt.title("Boxplot of text length by REAL or FAKE news")

plt.subplot(2, 2, 2)
sns.histplot(x="title_length", data=df, hue="label", hue_order=["REAL", "FAKE"])
plt.title("Histogram of title length by REAL or FAKE titles")

plt.subplot(2, 2, 4)
sns.histplot(x="text_length", data=df, hue="label", hue_order=["REAL", "FAKE"])
plt.title("Histogram of text length by REAL or FAKE news")

plt.tight_layout()
plt.show()

  • The presence of outliers in the boxplots indicates that the dataset contains articles with unusually long titles and text lengths for both real and fake news.
  • The boxplot shows that real news titles tend to be shorter than fake news titles. The histogram of title lengths further supports this conclusion by showing that the distribution of real news peaks at a lower character count and decreases more sharply than the distribution of fake news titles.
  • In contrast, real news articles tend to have a wider range of text lengths, although the median text length of both news types is pretty similar. Again, the histogram supports the boxplot theory by showing that fake news articles have a lower word count, clustering towards the left side of the plot, in contrast to the longer right tail displayed by the real news articles distribution.

Let’s plot the word clouds for titles and text for both types of news.

plt.figure(figsize=(12, 11))

for idx, news_type in enumerate(("FAKE", "REAL")):
titles = " ".join(df[df["label"] == news_type]["title_clean"])
news_wordcloud = WordCloud(
width=800, height=800, background_color="white"
).generate(titles)
plt.subplot(2, 2, idx + 1)
plt.imshow(news_wordcloud, interpolation="bilinear")
plt.title(f"Word Cloud for {news_type} News Titles")
plt.axis("off")

for idx, news_type in enumerate(("FAKE", "REAL")):
texts = " ".join(df[df["label"] == news_type]["text_clean"])
news_wordcloud = WordCloud(
width=800, height=800, background_color="white"
).generate(texts)
plt.subplot(2, 2, idx + 3)
plt.imshow(news_wordcloud, interpolation="bilinear")
plt.title(f"Word Cloud for {news_type} News Texts")
plt.axis("off")

plt.tight_layout()
plt.show()

  • Certain names and terms such as “Trump,” “Clinton,” and “Hillary” are prominently featured in both fake and real news, suggesting that political figures are common subjects in news articles regardless of their veracity.
  • Fake titles and text seem to feature sensational or emotionally charged words (e.g., lie, attack, war, power), as well as controversial topics (e.g., video, email, report), indicating a tendency to use more provocative language that presents speculation or opinion as fact.
  • The real news, in contrast, while still featuring some of the same political names, contains a notable presence of more diverse and policy-oriented terms (e.g., government, house, president, country, debate), which might reflect a broader coverage of topics and a focus on governance and national issues.

In addition to the above, the choice of words can also reflect the sentiment and tone of the articles, as we’ll see next.

For the sentiment analysis, we’ll use the raw (unfiltered) text. Not only the choice of words, but also punctuation and even letter case might indicate the tone and sentiment of the expression.

For the sentiment classification itself, we’ll use NLTK’s SentimentIntensityAnalyzer, which in turn uses the VADER lexicon to give a floating point score representing sentiment polarity for a specific text. A score closer to -1.0 represents a strong negative sentiment, while a score closer to 1.0 represents a strong positive sentiment.

For our analysis, a float score won’t be important. Instead, based on this score, we’ll classify the sentiment polarity in Strongly Negative, Negative, Neutral, Positive, and Strongly Positive, in this increasing order.

sia = SentimentIntensityAnalyzer(
lexicon_file="data/sentiment/vader_lexicon/vader_lexicon.txt"
)

def get_sentiment_score(text):
polarity_score = sia.polarity_scores(text)["compound"]
if polarity_score <= -0.6:
return "Strongly Negative"
elif polarity_score <= -0.2:
return "Negative"
elif polarity_score <= 0.2:
return "Neutral"
elif polarity_score <= 0.6:
return "Positive"
else:
return "Strongly Positive"

df["title_sentiment"] = df["title"].apply(get_sentiment_score)
df["text_sentiment"] = df["text"].apply(get_sentiment_score)

df.head()

With this, we can count the occurrences of each sentiment polarity class in article titles and text.

order = ["Strongly Negative", "Negative", "Neutral", "Positive", "Strongly Positive"]

plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
ax1 = sns.countplot(x="title_sentiment", order=order, data=df, hue="label")
ax1.set_xticks(ax1.get_xticks())
ax1.set_xticklabels(ax1.get_xticklabels(), rotation=-45, ha="left")
plt.title("VADER Title Sentiment Distribution by News Type")
plt.subplot(1, 2, 2)
ax2 = sns.countplot(x="text_sentiment", order=order, data=df, hue="label")
ax2.set_xticks(ax2.get_xticks())
ax2.set_xticklabels(ax2.get_xticklabels(), rotation=-45, ha="left")
plt.title("VADER Text Sentiment Distribution by News Type")

plt.tight_layout()
plt.show();

  • Real news titles have a higher count of Neutral sentiment compared to fake news titles, suggesting that real news tends to use a more measured and objective tone in titles.
  • Fake news titles show a broader distribution across different sentiments, with notably higher counts in Negative and Strongly Negative sentiments, indicating a tendency towards more emotionally charged language.
  • Real news text bodies show a significant lean towards Neutral, reinforcing the idea that the language used is more factual.
  • Fake news has a much higher presence in Strongly Negative sentiment than real news, which aligns with the idea that fake news may use language meant to incite strong emotional reactions, urgency, or controversy.
  • Both real and fake news have smaller counts for Positive and Strongly Positive sentiments in their titles and text bodies, suggesting that positive sentiment is less commonly used in news regardless of its veracity.

For the Machine Learning model, we’ll go for a simple approach. We’ll extract the TF-IDF vectors from the news articles’ texts and use them as features. Throughout my tests, I tried to use other features like TF-IDF vectors from the titles and sentiment polarity from both texts and titles, but I didn’t get any substantial improvement in the model’s accuracy.

We’ll choose the model itself from a list of pre-defined models and parameters. For each one, we’ll run a cross-validation of 5 folds using the train set. The model’s evaluation will be based on the cross-validation average accuracy, accuracy on the test set, inference time, and model size.

20% of the data will be separated as the test set while the remaining data will be the train set. The data will be stratified: the proportions of REAL and FAKE news will be maintained in both data sets.

# We'll use the already cleaned "text" colum.
X = df["text_clean"]
y = df["label"]

# We'll binarize the target feature right, way before the spliting, as this process is neither dependant or dependency for the models' training.
y_encoder = LabelEncoder()
y = y_encoder.fit_transform(y)

# The parameter "stratify" tells wich feature to be used for reference to mantain the data proportions ("label" in our case).
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, stratify=y, random_state=42
)
print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)

We’ll use Scikit-Learn’s TfidfVectorizer class to extract the TF-IDF vectors.

data_transformer = TfidfVectorizer(stop_words="english")

We’ll evaluate the following list of models:

  • Logistic Regression
  • Support Vector Machines
  • Random Forest Classifier
  • Multinomial Naive Bayes
  • XGBoost Classifier
  • LightGBM Classifier

Most of the parameter configurations used are default.

models = {
# max_iter is set to avoid early termination of the algorithm.
"Logistic Regression": LogisticRegression(max_iter=1000),
# The linear kernel will avoid overfitting.
"SVM": SVC(kernel='linear', probability=True),
"Random Forest": RandomForestClassifier(),
"MultinomialNB": MultinomialNB(),
# use_label_encoder was deprecated, so we set it to False to avoid warnings.
"XGBoost": xgb.XGBClassifier(use_label_encoder=False, eval_metric='logloss'),
"LightGBM": lgb.LGBMClassifier()
}

results_entries = []

models_dir = os.path.join("data", "models")
if not os.path.isdir(models_dir):
os.makedirs(models_dir)

for name, model in models.items():
print(f"Evaluating model {name}")
pipeline = make_pipeline(data_transformer, model)

scores = cross_val_score(pipeline, X_train, y_train, cv=5, scoring="accuracy")

start_train_time = time.time()
pipeline.fit(X_train, y_train)
train_time = time.time() - start_train_time

start_inference_time = time.time()
y_pred = pipeline.predict(X_test)
inference_time = time.time() - start_inference_time

test_accuracy = accuracy_score(y_test, y_pred)

model_file = f'{name.replace(" ", "_").lower()}_model.joblib'
joblib
model_path = os.path.join(models_dir, model_file)
joblib.dump(pipeline, model_path)
model_size = os.path.getsize(model_path)

results_entries.append(
{
"Model": name,
"Avg Accuracy": scores.mean(),
"Avg Std": scores.std(),
"Test Accuracy": test_accuracy,
"Train Time": train_time,
"Inference Time": inference_time,
"Model Size KB": model_size / 1024,
}
)

results = pd.DataFrame(results_entries)

results.sort_values(["Avg Accuracy", "Test Accuracy"], ascending=False)

  • SVM is likely the best performer in terms of cross-validation average accuracy and test accuracy, but this result comes at the cost of significantly longer training and inference times and larger model size compared to other models.
  • LightGBM and XGBoost offer a good compromise between high accuracy and efficiency in terms of both training/inference time and model size.
  • In the contest between LightGBM and XGBoost, the former has a slightly better performance and efficiency, with the addition of a more stable performance indicated by the lower standard deviation.

With the points above and having in mind the trade-off between accuracy and efficiency, we’ll proceed with the LGBMClassifier for the rest of our analysis.

Now that we have our chosen model, it’s time to train it using the train set and analyze the overall performance on the test set.

model = make_pipeline(data_transformer, lgb.LGBMClassifier(random_state=42))
model.fit(X_train, y_train)

We’ll create a dataframe containing the real and predicted labels for each news article. We’ll also include the clean article texts.

inferece_df = pd.DataFrame(
data={
"news_text": X_test,
"label": y_encoder.inverse_transform(y_test),
"prediction": y_encoder.inverse_transform(model.predict(X_test)),
}
)
inferece_df

With everything set, our analysis will be driven by the following questions:

  • How much and how do our model’s predictions fail?
  • Are the text size characteristics of the articles maintained with respect to their corresponding news types?
  • Are the common words of the articles maintained with respect to their corresponding news types?
  • Are the text sentiment characteristics of the articles maintained with respect to their corresponding news types?

Let’s plot the model’s confusion matrix.

conf_matrix = confusion_matrix(inferece_df["label"], inferece_df["prediction"])
plt.figure(figsize=(7, 7))
ax = sns.heatmap(conf_matrix, annot=True, cbar=False, cmap="plasma", fmt="g")
ax.set_xlabel("prediction")
ax.set_xticklabels(["FAKE", "REAL"])
ax.set_ylabel("label")
ax.set_yticklabels(["FAKE", "REAL"])
plt.title(f"Confusion matrix for the {model.steps[-1][-1].__class__.__name__} model");

Let’s also check its precision, recall and F1 score of the model.

print(classification_report(inferece_df["label"], inferece_df["prediction"]))
  • These metrics indicate that the model is performing well in detecting real news, with a strong balance between precision and recall.
  • The high precision suggests that there are relatively few false alarms, and the high recall means that the model is good at catching most of the real news.
  • The F1 score, being close to the precision and recall values, shows that the model doesn’t significantly favor one measure over the other, which is often desirable in a balanced classification task.

Now, we’ll check the general characteristics of the errors. So we’ll separate them into a new dataframe.

errors = inferece_df[inferece_df["label"] != inferece_df["prediction"]].copy()

We’ll also retrieve the sentiment polarity of each article.

errors["text_clean"] = errors["news_text"].apply(preprocess_text)
errors["text_sentiment"] = errors["news_text"].apply(get_sentiment_score)
errors.head()

Let’s plot the boxplot and histogram of the text size distribution of the prediction errors.

errors["text_length"] = errors["text_clean"].str.len()

plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
sns.boxplot(y="text_length", data=errors, hue="label", hue_order=["REAL", "FAKE"])
plt.title("Boxplot of text length by REAL or FAKE news (Errors)")

plt.subplot(1, 2, 2)
sns.histplot(x="text_length", data=errors, hue="label", hue_order=["REAL", "FAKE"])
plt.title("Histogram of text length by REAL or FAKE news (Errors)")

plt.tight_layout()
plt.show()

  • Indeed, one of our first statements about the difference between REAL and FAKE news isn’t valid here. In these samples, FAKE news articles tend to have longer texts than REAL news.
  • Both distributions still contain outliers, with fake news having a more noticeable outlier.

Let’s plot the word clouds for the errors.

plt.figure(figsize=(13, 11))

for idx, news_type in enumerate(("FAKE", "REAL")):
texts = " ".join(errors[errors["label"] == news_type]["text_clean"])
news_wordcloud = WordCloud(
width=800, height=800, background_color="white"
).generate(texts)
plt.subplot(2, 2, idx + 3)
plt.imshow(news_wordcloud, interpolation="bilinear")
plt.title(f"Word Cloud for {news_type} News Texts (Errors)")
plt.axis("off")

plt.tight_layout()
plt.show()

The word cloud for both fake and real news errors seems to share some common terms with the overall data clouds. However, some terms might appear more or less prominently compared to the overall dataset.

Finally, let’s plot the sentiment counts in the errors.

order = ["Strongly Negative", "Negative", "Neutral", "Positive", "Strongly Positive"]

plt.figure(figsize=(7, 5))
ax = sns.countplot(x="text_sentiment", order=order, data=errors, hue="label")
ax.set_xticks(ax.get_xticks())
ax.set_xticklabels(ax.get_xticklabels(), rotation=-45, ha="left")
plt.title("VADER Text Sentiment Distribution by News Type (Errors)")

plt.tight_layout()
plt.show();

  • There’s a significant number of errors in the “Strongly Positive” sentiment class, especially for real news. This suggests that the model struggles with correctly classifying real news articles that contain strongly positive sentiment.
  • The counts of errors across sentiment classes are relatively small compared to the overall distribution, which indicates that the model performs well in general. However, the errors that occur are disproportionately in the extreme sentiment classes.

In this project, we addressed the problem of detecting fake news.

  • We first gave a brief review of the most common text data concepts and processing techniques.
  • Then we proceeded to analyze the main features of the text data, such as text size, common words, and sentiment polarity. We also applied some text processing techniques to extract some of these features.
  • We applied the TF-IDF vector extraction technique to create the feature matrix and trained a list of machine learning models using it to detect fake news.
  • We also analyzed the prediction errors of the final model, comparing them with the analysis of the whole data.

Much of the concepts and figures presented here were inspired by Aysel Aydin’s posts about Natural Language Processing.

The code for this project and others that I’ve published can be found in this GitHub repository:

Other useful links:

Recent Articles

Related Stories

Leave A Reply

Please enter your comment!
Please enter your name here