Image by Author
Â
Cleaning and preprocessing data is often one of the most daunting, yet critical phases in building AI and Machine Learning solutions fueled by data, and text data is not the exception.
This tutorial breaks the ice in tackling the challenge of preparing text data for NLP tasks such as those Language Models (LMs) can solve. By encapsulating your text data in pandas DataFrames, the below steps will help you get your text ready for being digested by NLP models and algorithms.
Â
Load the Data into a Pandas DataFrame
Â
To keep this tutorial simple and focused on understanding the necessary text cleaning and preprocessing steps, let’s consider a small sample of four single-attribute text data instances that will be moved into a pandas DataFrame instance. We will from now on apply every preprocessing step on this DataFrame object.
import pandas as pd
data = {'text': ["I love cooking!", "Baking is fun", None, "Japanese cuisine is great!"]}
df = pd.DataFrame(data)
print(df)
Â
Output:
text
0 I love cooking!
1 Baking is fun
2 None
3 Japanese cuisine is great!
Â
Handle Missing Values
Â
Did you notice the ‘None’ value in one of the example data instances? This is known as a missing value. Missing values are commonly collected for various reasons, often accidental. The bottom line: you need to handle them. The simplest approach is to simply detect and remove instances containing missing values, as done in the code below:
df.dropna(subset=['text'], inplace=True)
print(df)
Â
Output:
text
0 I love cooking!
1 Baking is fun
3 Japanese cuisine is great!
Â
Normalize the Text to Make it Consistent
Â
Normalizing text implies standardizing or unifying elements that may appear under different formats across different instances, for instance, date formats, full names, or case sensitiveness. The simplest approach to normalize our text is to convert all of it to lowercase, as follows.
df['text'] = df['text'].str.lower()
print(df)
Â
Output:
text
0 i love cooking!
1 baking is fun
3 japanese cuisine is great!
Â
Remove Noise
Â
Noise is unnecessary or unexpectedly collected data that may hinder the subsequent modeling or prediction processes if not handled adequately. In our example, we will assume that punctuation marks like “!” are not needed for the subsequent NLP task to be applied, hence we apply some noise removal on it by detecting punctuation marks in the text using a regular expression. The ‘re’ Python package is used for working and performing text operations based on regular expression matching.
import re
df['text'] = df['text'].apply(lambda x: re.sub(r'[^\w\s]', '', x))
print(df)
Â
Output:
text
0 i love cooking
1 baking is fun
3 japanese cuisine is great
Â
Tokenize the Text
Â
Tokenization is arguably the most important text preprocessing step -along with encoding text into a numerical representation- before using NLP and language models. It consists in splitting each text input into a vector of chunks or tokens. In the simplest scenario, tokens are associated with words most of the time, but in some cases like compound words, one word might lead to multiple tokens. Certain punctuation marks (if they were not previously removed as noise) are also sometimes identified as standalone tokens.
This code splits each of our three text entries into individual words (tokens) and adds them as a new column in our DataFrame, then displays the updated data structure with its two columns. The simplified tokenization approach applied is known as simple whitespace tokenization: it just uses whitespaces as the criterion to detect and separate tokens.
df['tokens'] = df['text'].str.split()
print(df)
Â
Output:
text tokens
0 i love cooking [i, love, cooking]
1 baking is fun [baking, is, fun]
3 japanese cuisine is great [japanese, cuisine, is, great]
Â
Remove Stop Words
Â
Once the text is tokenized, we filter out unnecessary tokens. This is typically the case of stop words, like articles “a/an, the”, or conjunctions, which do not add actual semantics to the text and should be removed for later efficient processing. This process is language-dependent: the code below uses the NLTK library to download a dictionary of English stop words and filter them out from the token vectors.
import nltk
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))
df['tokens'] = df['tokens'].apply(lambda x: [word for word in x if word not in stop_words])
print(df['tokens'])
Â
Output:
0 [love, cooking]
1 [baking, fun]
3 [japanese, cuisine, great]
Â
Stemming and Lemmatization
Â
Almost there! Stemming and lemmatization are additional text preprocessing steps that might sometimes be used depending on the specific task at hand. Stemming reduces each token (word) to its base or root form, whilst lemmatization further reduces it to its lemma or base dictionary form depending on the context, e.g. “best” -> “good”. For simplicity, we will only apply stemming in this example, by using the PorterStemmer implemented in the NLTK library, aided by the wordnet dataset of word-root associations. The resulting stemmed words are saved in a new column in the DataFrame.
from nltk.stem import PorterStemmer
nltk.download('wordnet')
stemmer = PorterStemmer()
df['stemmed'] = df['tokens'].apply(lambda x: [stemmer.stem(word) for word in x])
print(df[['tokens','stemmed']])
Â
Output:
tokens stemmed
0 [love, cooking] [love, cook]
1 [baking, fun] [bake, fun]
3 [japanese, cuisine, great] [japanes, cuisin, great]
Â
Convert Text into Numerical Representations
Â
Last but not least, computer algorithms including AI/ML models do not understand human language but numbers, hence we need to map our word vectors into numerical representations, commonly known as embedding vectors, or simply embedding. The below example converts tokenized text in the ‘tokens’ column and uses a TF-IDF vectorization approach (one of the most popular approaches in the good old days of classical NLP) to transform the text into numerical representations.
from sklearn.feature_extraction.text import TfidfVectorizer
df['text'] = df['tokens'].apply(lambda x: ' '.join(x))
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(df['text'])
print(X.toarray())
Â
Output:
[[0. 0.70710678 0. 0. 0. 0. 0.70710678]
[0.70710678 0. 0. 0.70710678 0. 0. 0. ]
[0. 0. 0.57735027 0. 0.57735027 0.57735027 0. ]]
Â
And that’s it! As unintelligible as it may seem to us, this numerical representation of our preprocessed text is what intelligent systems including NLP models do understand and can handle exceptionally well for challenging language tasks like classifying sentiment in text, summarizing it, or even translating it to another language.
The next step would be feeding these numerical representations to our NLP model to let it do its magic.
Â
Â
Iván Palomares Carrascosa is a leader, writer, speaker, and adviser in AI, machine learning, deep learning & LLMs. He trains and guides others in harnessing AI in the real world.