Learning natural language processing can be a super useful addition to your developer toolkit. From the basics to building LLM-powered applications, you can get up to speed natural language processing—in a few weeks—one small step at a time. And this article will help you get started.
In this article, we’ll learn the basics of natural language processing with Python—taking a code-first approach using NLTK or the Natural Language Toolkit (NLTK). Let’s begin!
▶️ Link to the Google Colab notebook for this tutorial
Installing NLTK
Before diving into NLP tasks, we need to install the Natural Language Toolkit (NLTK). NLTK provides a suite of text processing tools—tokenizers, lemmatizers, POS taggers, and preloaded datasets. It’s more like a Swiss army knife of NLP. Setting it up involves installing the library and downloading the necessary datasets and models.
Install the NLTK Library
Run the following command in your terminal or command prompt to install NLTK:
This installs the core NLTK library, which contains the main modules needed for text processing tasks.
Download NLTK Resources
After installation, download NLTK’s pre-packaged datasets and tools. These include stopword lists, tokenizers, and lexicons like WordNet:
import nltk
# Download essential datasets and models nltk.download(‘punkt’) # Tokenizers for sentence and word tokenization nltk.download(‘stopwords’) # List of common stop words nltk.download(‘wordnet’) # WordNet lexical database for lemmatization nltk.download(‘averaged_perceptron_tagger_eng’) # Part-of-speech tagger nltk.download(‘maxent_ne_chunker_tab’) # Named Entity Recognition model nltk.download(‘words’) # Word corpus for NER nltk.download(‘punkt_tab’) |
Text Preprocessing
Text preprocessing is an essential step in NLP, transforming raw text into a clean and structured format that makes it easier to analyze. The goal is to zero in on the meaningful components of the text while also breaking down the text into chunks that can be processed.
In this section, we cover three important preprocessing steps: tokenization, stop word removal, and stemming.
Tokenization
Tokenization is one of the common preprocessing tasks. It involves splitting text into smaller units—tokens. These tokens can be words, sentences, or even sub-word units, depending on the task.
- Sentence tokenization splits the text into sentences
- Word tokenization splits the text into words and punctuation marks
In the following code, we use NLTK’s sent_tokenize to split the input text into sentences, and word_tokenize to break it down into words. But we also do a super simple prerpocessing step of removing all punctuation from the text:
import string from nltk.tokenize import word_tokenize, sent_tokenize
text = “Natural Language Processing (NLP) is cool! Let’s explore it.”
# Remove punctuation using string.punctuation cleaned_text = ”.join(char for char in text if char not in string.punctuation) print(“Text without punctuation:”, cleaned_text)
# Sentence Tokenization sentences = sent_tokenize(cleaned_text) print(“Sentences:”, sentences)
# Word Tokenization words = word_tokenize(cleaned_text) print(“Words:”, words) |
This allows us to analyze the structure of the text at both the sentence and word levels.
In this example, sent_tokenize(text) splits the input string into sentences, returning a list of sentence strings. The output of this function is a list with two elements: one for each sentence in the original text.
Next, the word_tokenize(text) function is applied to the same text. It breaks down the text into individual words and punctuation, treating things like parentheses and exclamation marks as separate tokens. But we’ve removed all punctuation, so the output is as follows:
Text without punctuation: Natural Language Processing NLP is cool Lets explore it Sentences: [‘Natural Language Processing NLP is cool Lets explore it’] Words: [‘Natural’, ‘Language’, ‘Processing’, ‘NLP’, ‘is’, ‘cool’, ‘Lets’, ‘explore’, ‘it’] |
Stopwords Removal
Stopwords are common words such as “the,” “and,” or “is” that occur frequently but carry little meaning in most analyses. Removing these words helps focus on the more meaningful words in the text.
In essence, you filter out stop words to reduce noise in the dataset. We can use NLTK’s stopwords corpus to identify and remove stop words from the list of tokens obtained after tokenization:
from nltk.corpus import stopwords
# Load NLTK’s stopwords list stop_words = set(stopwords.words(‘english’))
# Filter out stop words filtered_words = [word for word in words if word.lower() not in stop_words] print(“Filtered Words:”, filtered_words) |
Here, we load the set of English stop words using stopwords.words(‘english’) from NLTK. Then, we use a list comprehension to iterate over the list of tokens generated by word_tokenize. By checking whether each token (converted to lowercase) is in the set of stop words, we remove common words that don’t contribute to the meaning of the text.
Here’s the filtered result:
Filtered Words: [‘Natural’, ‘Language’, ‘Processing’, ‘NLP’, ‘cool’, ‘Lets’, ‘explore’] |
Stemming
Stemming is the process of reducing words to their root form by removing affixes like suffixes and prefixes. The root form may not always be a valid word in the dictionary, but it helps in standardizing variations of the same word.
Porter Stemmer is a common stemming algorithm that works by removing suffixes. Let’s use NLTK’s PorterStemmer to stem the filtered word list:
from nltk.stem import PorterStemmer
# Initialize the Porter Stemmer stemmer = PorterStemmer()
# Apply stemming to filtered words stemmed_words = [stemmer.stem(word) for word in filtered_words] print(“Stemmed Words:”, stemmed_words) |
Here, we initialize the PorterStemmer and use it to process each word in the list filtered_words.
The stemmer.stem() function strips common suffixes like “-ing,” “-ed,” and “-ly” from words to reduce them to their root form. While stemming helps reduce the number of variations of words, it’s important to note that the results may not always be valid dictionary words.
Stemmed Words: [‘natur’, ‘languag’, ‘process’, ‘nlp’, ‘cool’, ‘let’, ‘explor’] |
Before we proceed, here’s a summary of the text preprocessing steps:
- Tokenization breaks text into smaller units.
- Stop word removal filters out common, non-meaningful words to focus on more significant terms in the analysis.
- Stemming reduces words to their root forms, simplifying variations and helping standardize text for analysis.
With these preprocessing steps completed, you can move on to learn about lemmatization, part-of-speech tagging, and named entity recognition.
Lemmatization
Lemmatization is similar to stemming in that it also reduces words to their base form. But unlike stemming, lemmatization returns valid dictionary words. Lemmatization factors in the context such as its part of speech (POS) to reduce the word to its lemma. For example, the words “running” and “ran” would be reduced to “run.”
Lemmatization generally produces more accurate results than stemming, as it keeps the word in a recognizable form. The most common tool for lemmatization in NLTK is the WordNetLemmatizer, which uses the WordNet lexical database.
- Lemmatization reduces a word to its lemma by considering its meaning and context—not just by chopping off affixes.
- WordNetLemmatizer is the NLTK tool commonly used for lemmatization.
In the code snippet below, we use NLTK’s WordNetLemmatizer to lemmatize words from the previously filtered list:
from nltk.stem import WordNetLemmatizer
# Initialize the Lemmatizer lemmatizer = WordNetLemmatizer()
# Lemmatize each word lemmatized_words = [lemmatizer.lemmatize(word, pos=‘v’) for word in filtered_words] print(“Lemmatized Words:”, lemmatized_words) |
Here, we initialize the WordNetLemmatizer and use its lemmatize() method to process each word in the filtered_words list. We specify pos=’v’ to tell the lemmatizer that we’d like to reduce verbs in the text to their root form. This helps the lemmatizer understand the part of speech and apply the correct lemmatization rule.
Lemmatized Words: [‘Natural’, ‘Language’, ‘Processing’, ‘NLP’, ‘cool’, ‘Lets’, ‘explore’] |
So why is lemmatization helpful? Lemmatization is particularly useful when you want to reduce words to their base form but still retain their meaning. It’s a more accurate and context-sensitive method compared to stemming. Which makes it ideal for tasks that require high accuracy, such as text classification or sentiment analysis.
Part-of-Speech (POS) Tagging
Part-of-speech (POS) tagging involves identifying the grammatical category of each word in a sentence, such as nouns, verbs, adjectives, adverbs, and more. POS tagging also helps understand the syntactic structure of a sentence, enabling better handling of tasks such as text parsing, information extraction, and machine translation.
The POS tags assigned to words can be based on a standard set such as the Penn Treebank POS tags. For example, in the sentence “The dog runs fast,” “dog” would be tagged as a noun (NN), “runs” as a verb (VBZ), and “fast” as an adverb (RB).
- POS tagging assigns labels to words in a sentence.
- Tagging helps analyze the syntax of the sentence and understand word functions in context.
With NLTK, you can perform POS tagging using the pos_tag function, which tags each word in a list of tokens with its part of speech. In the following example, we first tokenize the text and then use NLTK’s pos_tag function to assign POS tags to each word.
from nltk import pos_tag
# Tokenize the text into words text = “She enjoys playing soccer on weekends.”
# Tokenization (words) words = word_tokenize(text)
# POS tagging tagged_words = pos_tag(words) print(“Tagged Words:”, tagged_words) |
This should output:
Tagged Words: [(‘She’, ‘PRP’), (‘enjoys’, ‘VBZ’), (‘playing’, ‘VBG’), (‘soccer’, ‘NN’), (‘on’, ‘IN’), (‘weekends’, ‘NNS’), (‘.’, ‘.’)] |
POS tagging is necessary for understanding sentence structure and for tasks that involve syntactic analysis, such as named entity recognition (NER) and machine translation.
Named Entity Recognition (NER)
Named Entity Recognition (NER) is an NLP task used to identify and classify named entities in a text, such as the names of people, organizations, locations, and dates. This technique is essential for understanding and extracting useful information from text.
Here is an example:
from nltk import ne_chunk, pos_tag, word_tokenize
# Sample text text = “We shall visit the Eiffel Tower on our vacation to Paris.”
# Tokenize the text into words words = word_tokenize(text)
# Part-of-speech tagging tagged_words = pos_tag(words)
# Named Entity Recognition named_entities = ne_chunk(tagged_words) print(“Named Entities:”, named_entities) |
In this case, NER helps extract geographical references, such as the landmark and the city.
Named Entities: (S We/PRP shall/MD visit/VB the/DT (ORGANIZATION Eiffel/NNP Tower/NNP) on/IN our/PRP$ vacation/NN to/TO (GPE Paris/NNP) ./.) |
This can then be used in various tasks, such as summarizing articles, extracting information for knowledge graphs, and more.
Wrap-Up & Next Steps
In this guide, we’ve covered essential concepts in natural language processing using NLTK—from basic text preprocessing to slightly more involved techniques like lemmatization, POS tagging, and named entity recognition.
So where do you go from here? As you continue your NLP journey, here are a few next steps to consider:
- Work on simple text classification problems using algorithms like logistic regression, support vector machine, and Naive Bayes.
- Try sentiment analysis with tools like VADER or by training your own classifier.
- Dive deeper into topic modeling or text summarization tasks.
- Explore other NLP libraries such as spaCy or Hugging Face’s Transformers for state-of-the-art models.
What would you like to learn next? Let us know! Keep coding!