How to Do Named Entity Recognition (NER) with a BERT Model


Named Entity Recognition (NER) is one of the fundamental building blocks of natural language understanding. When humans read text, we naturally identify and categorize named entities based on context and world knowledge. For instance, in the sentence “Microsoft’s CEO Satya Nadella spoke at a conference in Seattle,” we effortlessly recognize the organizational, personal, and geographical references. However, teaching machines to replicate this seemingly intuitive human capability presents several challenges. Fortunately, this problem can be addressed effectively using a pretrained machine learning model.

In this post, you will learn how to solve the NER problem with a BERT model using just a few lines of Python code.

Let’s get started.

How to Do Named Entity Recognition (NER) with a BERT Model
Picture by Jon Tyson. Some rights reserved.

Overview

This post is in six parts; they are:

  • The Complexity of NER Systems
  • The Evolution of NER Technology
  • BERT’s Revolutionary Approach to NER
  • Using DistilBERT with Hugging Face’s Pipeline
  • Using DistilBERT Explicitly with AutoModelForTokenClassification
  • Best Practices for NER Implementation

The Complexity of NER Systems

The challenge of Named Entity Recognition extends far beyond simple pattern matching or dictionary lookups. Several key factors contribute to its complexity.

One of the most significant challenges is context dependency—understanding how words change meaning based on surrounding text. The same word can represent different entity types depending on its context. Consider these examples:

  • Apple announced new products.” (Apple is an organization.)
  • I ate an apple for lunch.” (Apple is a common noun, not a named entity.)
  • Apple Street is closed.” (Apple is a location.)

Named entities often consist of multiple words, making boundary detection another challenge. Entity names can be complex, such as:

  • Corporate entities: “Bank of America Corporation”
  • Product names: “iPhone 14 Pro Max”
  • Person names: “Martin Luther King Jr.”

Additionally, language is dynamic and continuously evolving. Instead of memorizing what qualifies as an entity, models must deduce it from context. Language evolution introduces new entities, such as emerging companies, new products, and newly coined terms.

Now, let’s explore how state-of-the-art NER models address these challenges.

The Evolution of NER Technology

The evolution of NER technology reflects the broader advancement of natural language processing. Early approaches relied on rule-based systems and pattern matching—defining grammatical patterns, identifying capitalization, and using contextual markers (e.g., “the” before a proper noun). However, these rules were often numerous, inconsistent, and difficult to scale.

To improve accuracy, researchers introduced statistical approaches, leveraging probability-based models such as Hidden Markov Models (HMMs) and Conditional Random Fields (CRFs) to identify named entities.

With the rise of deep learning, neural networks became the preferred method for NER. Initially, bidirectional LSTM networks showed promise. However, the introduction of attention mechanisms and transformer-based models proved to be even more effective.

BERT’s Revolutionary Approach to NER

BERT (Bidirectional Encoder Representations from Transformers) has fundamentally transformed NER with several key innovations:

Contextual Understanding

Unlike traditional models that process text in one direction, BERT’s bidirectional nature allows it to consider both preceding and following text. This enables it to capture long-range dependencies, understand subtle contextual nuances, and handle ambiguous cases more effectively.

Tokenization and Subword Units

While not exclusive to BERT, its subword tokenization strategy allows it to handle unknown words while preserving morphological information. This reduces vocabulary size and makes the model adaptable across different languages and domains.

The IOB Tagging Mechanism

NER results can be represented in various ways, but BERT uses the Inside-Outside-Beginning (IOB) tagging scheme:

  • B marks the beginning of an entity.
  • I indicates the continuation of an entity.
  • O signifies non-entities.

This method enables BERT to handle multi-word entities, nested entities, and overlapping entities effectively.

Using DistilBERT with Hugging Face’s Pipeline

The easiest way to perform NER is by using Hugging Face’s pipeline API, which abstracts away much of the complexity while still delivering powerful results. Here’s an example:

Now, let’s break down this code in detail. First, you initialize the pipeline:

The pipeline() function creates a ready-to-use NER pipeline. This is essential because, while BERT is a machine learning model, text must be preprocessed before it can be processed by the model. Additionally, the model’s output needs to be converted into a usable format. A pipeline handles these steps automatically.

The argument "ner" specifies you want Named Entity Recognition and model="dbmdz/bert-large-cased-finetuned-conll03-english" loads a pre-trained model fine-tuned specifically for NER. The final argument, aggregation_strategy="simple", ensures that subwords are merged into complete words, making the output more readable.

The pipeline above returns a list of dictionaries, where each dictionary contains:

  • word: The detected entity text
  • entity_group: The type of entity (e.g., PER for person, ORG for organization)
  • score: Confidence score between 0 and 1
  • start and end: Character positions in the original text

This code will output the following:

Using DistilBERT Explicitly with AutoModelForTokenClassification

For greater control over the NER process, you can bypass the pipeline and work directly with the model and tokenizer. This approach provides more flexibility and insight into the process. Here’s an example:

This implementation is more detailed. Let’s walk through it step by step. First, you load the model and tokenizer:

The AutoTokenizer class automatically selects the appropriate tokenizer based on the model card, ensuring compatibility. Tokenizers are responsible for transforming input text into tokens. AutoModelForTokenClassification loads a model fine-tuned for token classification tasks, including both the model architecture and pre-trained weights.

Next, you preprocess the input text:

This step converts the text into token IDs that the model can process. A token is typically a word but can also be a subword. For example, “sub-” and “-word” may be recognized separately even though they appear as a single word. The return_tensors="pt" argument returns the sequence as PyTorch tensors, while add_special_tokens=True ensures the inclusion of [CLS] and [SEP] tokens to the beginning and the end of the output, which are required by BERT.

Then, you run the model on the input tensor:

Using torch.no_grad() disables gradient calculation during inference, saving both time and memory. The function torch.argmax(outputs.logits, dim=2) selects the most likely label for each token. The tensor predictions is a tensor of integers.

To convert the model’s output into human-readable text, we prepare a mapping between prediction indices and actual entity labels:

Dictionary model.config.id2label is a mapping of prediction indices to actual entity labels. The function convert_ids_to_tokens converts integer token IDs back to readable text. Since you run the model with a single line of input text, only a sequence of output is expected. We convert the predictions to a Python list for easier processing.

Finally, you reconstruct the entity predictions using a loop. Since BERT’s tokenizer sometimes splits words into subwords (indicated by "##"), you merge them back into complete words. The entity type is determined using the label_list dictionary.

Best Practices for NER Implementation

Performing Named Entity Recognition (NER) is as simple as shown above. However, you are not required to use the exact code provided. Specifically, you can switch between different models (along with the corresponding tokenizer). If you need faster processing, consider using a DistilBERT model. If accuracy is a priority, opt for a larger BERT or RoBERTa model. Additionally, if your input requires domain-specific knowledge, you may benefit from using a domain-adapted model.

If you need to process a large volume of text for NER, you can improve efficiency by processing inputs in batches. Other techniques, such as using a GPU for acceleration or caching results for frequently accessed texts, can further enhance performance.

In a production system, proper error-handling logic should also be implemented. This includes validating input, handling edge cases such as empty strings and special characters, and addressing other potential issues.

Here’s a complete example incorporating these best practices:

Summary

Named Entity Recognition with BERT models provides a powerful way to extract structured information from text. The Hugging Face Transformers library makes it easy to implement NER with state-of-the-art models, whether you need a simple pipeline approach or more detailed control over the process.

In this tutorial, you learned about NER with BERT. In particular, you learned how to:

  • Use the pipeline API for quick prototypes and simple applications
  • Use explicit model handling for more control and custom processing
  • Consider performance optimization for production applications
  • Always handle edge cases and implement proper error handling

With these tools and techniques, you can build robust NER systems for various applications, from information extraction to document processing and more.

 

Recent Articles

Related Stories

Leave A Reply

Please enter your comment!
Please enter your name here