How to Do Named Entity Recognition (NER) with a BERT Model

Named Entity Recognition (NER) is one of the fundamental building blocks of natural language understanding. When humans read text, we naturally identify and categorize named entities based on context and world knowledge. For instance, in the sentence “Microsoft’s CEO Satya Nadella spoke at a conference in Seattle,” we effortlessly recognize the organizational, personal, and geographical references. However, teaching machines to replicate this seemingly intuitive human capability presents several challenges. Fortunately, this problem can be addressed effectively using a pretrained machine learning model.

In this post, you will learn how to solve the NER problem with a BERT model using just a few lines of Python code.

Let’s get started.

How to Do Named Entity Recognition (NER) with a BERT Model
Picture by Jon Tyson. Some rights reserved.

Overview

This post is in six parts; they are:

The Complexity of NER Systems
The Evolution of NER Technology
BERT’s Revolutionary Approach to NER
Using DistilBERT with Hugging Face’s Pipeline
Using DistilBERT Explicitly with AutoModelForTokenClassification
Best Practices for NER Implementation

The Complexity of NER Systems

The challenge of Named Entity Recognition extends far beyond simple pattern matching or dictionary lookups. Several key factors contribute to its complexity.

One of the most significant challenges is context dependency—understanding how words change meaning based on surrounding text. The same word can represent different entity types depending on its context. Consider these examples:

“Apple announced new products.” (Apple is an organization.)
“I ate an apple for lunch.” (Apple is a common noun, not a named entity.)
“Apple Street is closed.” (Apple is a location.)

Named entities often consist of multiple words, making boundary detection another challenge. Entity names can be complex, such as:

Corporate entities: “Bank of America Corporation”
Product names: “iPhone 14 Pro Max”
Person names: “Martin Luther King Jr.”

Additionally, language is dynamic and continuously evolving. Instead of memorizing what qualifies as an entity, models must deduce it from context. Language evolution introduces new entities, such as emerging companies, new products, and newly coined terms.

Now, let’s explore how state-of-the-art NER models address these challenges.

The Evolution of NER Technology

The evolution of NER technology reflects the broader advancement of natural language processing. Early approaches relied on rule-based systems and pattern matching—defining grammatical patterns, identifying capitalization, and using contextual markers (e.g., “the” before a proper noun). However, these rules were often numerous, inconsistent, and difficult to scale.

To improve accuracy, researchers introduced statistical approaches, leveraging probability-based models such as Hidden Markov Models (HMMs) and Conditional Random Fields (CRFs) to identify named entities.

With the rise of deep learning, neural networks became the preferred method for NER. Initially, bidirectional LSTM networks showed promise. However, the introduction of attention mechanisms and transformer-based models proved to be even more effective.

BERT’s Revolutionary Approach to NER

BERT (Bidirectional Encoder Representations from Transformers) has fundamentally transformed NER with several key innovations:

Contextual Understanding

Unlike traditional models that process text in one direction, BERT’s bidirectional nature allows it to consider both preceding and following text. This enables it to capture long-range dependencies, understand subtle contextual nuances, and handle ambiguous cases more effectively.

Tokenization and Subword Units

While not exclusive to BERT, its subword tokenization strategy allows it to handle unknown words while preserving morphological information. This reduces vocabulary size and makes the model adaptable across different languages and domains.

The IOB Tagging Mechanism

NER results can be represented in various ways, but BERT uses the Inside-Outside-Beginning (IOB) tagging scheme:

B marks the beginning of an entity.
I indicates the continuation of an entity.
O signifies non-entities.

This method enables BERT to handle multi-word entities, nested entities, and overlapping entities effectively.

Using DistilBERT with Hugging Face’s Pipeline

The easiest way to perform NER is by using Hugging Face’s pipeline API, which abstracts away much of the complexity while still delivering powerful results. Here’s an example:

from transformers import pipeline # Initialize the NER pipeline ner_pipeline = pipeline(“ner”, model=”dbmdz/bert-large-cased-finetuned-conll03-english”, aggregation_strategy=”simple”) # Text example text = “Apple CEO Tim Cook announced new iPhone models in California yesterday.” # Perform NER entities = ner_pipeline(text) # Print the results for entity in entities: print(f”Entity: {entity[‘word’]}”) print(f”Type: {entity[‘entity_group’]}”) print(f”Confidence: {entity[‘score’]:.4f}”) print(“-” * 30)

from transformers import pipeline

# Initialize the NER pipeline

ner_pipeline = pipeline(“ner”,

model=“dbmdz/bert-large-cased-finetuned-conll03-english”,

aggregation_strategy=“simple”)

# Text example

text = “Apple CEO Tim Cook announced new iPhone models in California yesterday.”

# Perform NER

entities = ner_pipeline(text)

# Print the results

for entity in entities:

print(f“Entity: {entity[‘word’]}”)

print(f“Type: {entity[‘entity_group’]}”)

print(f“Confidence: {entity[‘score’]:.4f}”)

print(“-“ * 30)

Now, let’s break down this code in detail. First, you initialize the pipeline:

ner_pipeline = pipeline(“ner”, model=”dbmdz/bert-large-cased-finetuned-conll03-english”, aggregation_strategy=”simple”)

ner_pipeline = pipeline(“ner”,

model=“dbmdz/bert-large-cased-finetuned-conll03-english”,

aggregation_strategy=“simple”)

The pipeline() function creates a ready-to-use NER pipeline. This is essential because, while BERT is a machine learning model, text must be preprocessed before it can be processed by the model. Additionally, the model’s output needs to be converted into a usable format. A pipeline handles these steps automatically.

The argument "ner" specifies you want Named Entity Recognition and model="dbmdz/bert-large-cased-finetuned-conll03-english" loads a pre-trained model fine-tuned specifically for NER. The final argument, aggregation_strategy="simple", ensures that subwords are merged into complete words, making the output more readable.

The pipeline above returns a list of dictionaries, where each dictionary contains:

word: The detected entity text
entity_group: The type of entity (e.g., PER for person, ORG for organization)
score: Confidence score between 0 and 1
start and end: Character positions in the original text

This code will output the following:

Entity: Apple Type: ORG Confidence: 0.9987 —————————— Entity: Tim Cook Type: PER Confidence: 0.9956 —————————— Entity: California Type: LOC Confidence: 0.9934 ——————————

Entity: Apple

Type: ORG

Confidence: 0.9987

——————————

Entity: Tim Cook

Type: PER

Confidence: 0.9956

——————————

Entity: California

Type: LOC

Confidence: 0.9934

——————————

Using DistilBERT Explicitly with AutoModelForTokenClassification

For greater control over the NER process, you can bypass the pipeline and work directly with the model and tokenizer. This approach provides more flexibility and insight into the process. Here’s an example:

from transformers import AutoTokenizer, AutoModelForTokenClassification import torch # Load model and tokenizer model_name = “dbmdz/bert-large-cased-finetuned-conll03-english” tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForTokenClassification.from_pretrained(model_name) # Text example text = “Google and Microsoft are competing in the AI space while Elon Musk founded SpaceX.” # Tokenize the text inputs = tokenizer(text, return_tensors=”pt”, add_special_tokens=True) # Get predictions with torch.no_grad(): outputs = model(**inputs) predictions = torch.argmax(outputs.logits, dim=2) # Convert predictions to labels label_list = model.config.id2label tokens = tokenizer.convert_ids_to_tokens(inputs[“input_ids”][0]) predictions = predictions[0].tolist() # Process results current_entity = [] current_entity_type = None for token, prediction in zip(tokens, predictions): if token.startswith(“##”): if current_entity: current_entity.append(token[2:]) else: if current_entity: print(f”Entity: {”.join(current_entity)}”) print(f”Type: {current_entity_type}”) print(“-” * 30) current_entity = [] if label_list[prediction] != “O”: current_entity = [token] current_entity_type = label_list[prediction] # Print final entity if exists if current_entity: print(f”Entity: {”.join(current_entity)}”) print(f”Type: {current_entity_type}”)

from transformers import AutoTokenizer, AutoModelForTokenClassification

import torch

# Load model and tokenizer

model_name = “dbmdz/bert-large-cased-finetuned-conll03-english”

tokenizer = AutoTokenizer.from_pretrained(model_name)

model = AutoModelForTokenClassification.from_pretrained(model_name)

# Text example

text = “Google and Microsoft are competing in the AI space while Elon Musk founded SpaceX.”

# Tokenize the text

inputs = tokenizer(text, return_tensors=“pt”, add_special_tokens=True)

# Get predictions

with torch.no_grad():

outputs = model(**inputs)

predictions = torch.argmax(outputs.logits, dim=2)

# Convert predictions to labels

label_list = model.config.id2label

tokens = tokenizer.convert_ids_to_tokens(inputs[“input_ids”][0])

predictions = predictions[0].tolist()

# Process results

current_entity = []

current_entity_type = None

for token, prediction in zip(tokens, predictions):

if token.startswith(“##”):

if current_entity:

current_entity.append(token[2:])

else:

if current_entity:

print(f“Entity: {”.join(current_entity)}”)

print(f“Type: {current_entity_type}”)

print(“-“ * 30)

current_entity = []

if label_list[prediction] != “O”:

current_entity = [token]

current_entity_type = label_list[prediction]

# Print final entity if exists

if current_entity:

print(f“Entity: {”.join(current_entity)}”)

print(f“Type: {current_entity_type}”)

This implementation is more detailed. Let’s walk through it step by step. First, you load the model and tokenizer:

model_name = “dbmdz/bert-large-cased-finetuned-conll03-english” tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForTokenClassification.from_pretrained(model_name)

model_name = “dbmdz/bert-large-cased-finetuned-conll03-english”

tokenizer = AutoTokenizer.from_pretrained(model_name)

model = AutoModelForTokenClassification.from_pretrained(model_name)

The AutoTokenizer class automatically selects the appropriate tokenizer based on the model card, ensuring compatibility. Tokenizers are responsible for transforming input text into tokens. AutoModelForTokenClassification loads a model fine-tuned for token classification tasks, including both the model architecture and pre-trained weights.

Next, you preprocess the input text:

inputs = tokenizer(text, return_tensors=”pt”, add_special_tokens=True)

inputs = tokenizer(text, return_tensors=“pt”, add_special_tokens=True)

This step converts the text into token IDs that the model can process. A token is typically a word but can also be a subword. For example, “sub-” and “-word” may be recognized separately even though they appear as a single word. The return_tensors="pt" argument returns the sequence as PyTorch tensors, while add_special_tokens=True ensures the inclusion of [CLS] and [SEP] tokens to the beginning and the end of the output, which are required by BERT.

Then, you run the model on the input tensor:

with torch.no_grad(): outputs = model(**inputs) predictions = torch.argmax(outputs.logits, dim=2)

with torch.no_grad():

outputs = model(**inputs)

predictions = torch.argmax(outputs.logits, dim=2)

Using torch.no_grad() disables gradient calculation during inference, saving both time and memory. The function torch.argmax(outputs.logits, dim=2) selects the most likely label for each token. The tensor predictions is a tensor of integers.

To convert the model’s output into human-readable text, we prepare a mapping between prediction indices and actual entity labels:

label_list = model.config.id2label tokens = tokenizer.convert_ids_to_tokens(inputs[“input_ids”][0]) predictions = predictions[0].tolist()

label_list = model.config.id2label

tokens = tokenizer.convert_ids_to_tokens(inputs[“input_ids”][0])

predictions = predictions[0].tolist()

Dictionary model.config.id2label is a mapping of prediction indices to actual entity labels. The function convert_ids_to_tokens converts integer token IDs back to readable text. Since you run the model with a single line of input text, only a sequence of output is expected. We convert the predictions to a Python list for easier processing.

Finally, you reconstruct the entity predictions using a loop. Since BERT’s tokenizer sometimes splits words into subwords (indicated by "##"), you merge them back into complete words. The entity type is determined using the label_list dictionary.

Best Practices for NER Implementation

Performing Named Entity Recognition (NER) is as simple as shown above. However, you are not required to use the exact code provided. Specifically, you can switch between different models (along with the corresponding tokenizer). If you need faster processing, consider using a DistilBERT model. If accuracy is a priority, opt for a larger BERT or RoBERTa model. Additionally, if your input requires domain-specific knowledge, you may benefit from using a domain-adapted model.

If you need to process a large volume of text for NER, you can improve efficiency by processing inputs in batches. Other techniques, such as using a GPU for acceleration or caching results for frequently accessed texts, can further enhance performance.

In a production system, proper error-handling logic should also be implemented. This includes validating input, handling edge cases such as empty strings and special characters, and addressing other potential issues.

Here’s a complete example incorporating these best practices:

from transformers import pipeline import torch import logging from typing import List, Dict class NERProcessor: def __init__(self, model_name: str = “dbmdz/bert-large-cased-finetuned-conll03-english”, confidence_threshold: float = 0.8): self.confidence_threshold = confidence_threshold try: self.device = “cuda” if torch.cuda.is_available() else “cpu” self.ner_pipeline = pipeline(“ner”, model=model_name, aggregation_strategy=”simple”, device=self.device) except Exception as e: logging.error(f”Failed to initialize NER pipeline: {str(e)}”) raise def process_text(self, text: str) -> List[Dict]: if not text or not isinstance(text, str): logging.warning(“Invalid input text”) return [] try: # Get predictions entities = self.ner_pipeline(text) # Post-process results filtered_entities = [ entity for entity in entities if entity[‘score’] >= self.confidence_threshold ] return filtered_entities except Exception as e: logging.error(f”Error processing text: {str(e)}”) return [] if __name__ == “__main__”: # Initialize processor processor = NERProcessor() # Text example text = “”” Apple Inc. CEO Tim Cook announced new partnerships with Microsoft and Google during a conference in New York City. The event was also attended by Sundar Pichai and Satya Nadella. “”” # Process text results = processor.process_text(text) # Print results for entity in results: print(f”Entity: {entity[‘word’]}”) print(f”Type: {entity[‘entity_group’]}”) print(f”Confidence: {entity[‘score’]:.4f}”) print(“-” * 30)

from transformers import pipeline

import torch

import logging

from typing import List, Dict

class NERProcessor:

def __init__(self,

model_name: str = “dbmdz/bert-large-cased-finetuned-conll03-english”,

confidence_threshold: float = 0.8):

self.confidence_threshold = confidence_threshold

try:

self.device = “cuda” if torch.cuda.is_available() else “cpu”

self.ner_pipeline = pipeline(“ner”,

model=model_name,

aggregation_strategy=“simple”,

device=self.device)

except Exception as e:

logging.error(f“Failed to initialize NER pipeline: {str(e)}”)

raise

def process_text(self, text: str) -> List[Dict]:

if not text or not isinstance(text, str):

logging.warning(“Invalid input text”)

return []

try:

# Get predictions

entities = self.ner_pipeline(text)

# Post-process results

filtered_entities = [

entity for entity in entities

if entity[‘score’] >= self.confidence_threshold

]

return filtered_entities

except Exception as e:

logging.error(f“Error processing text: {str(e)}”)

return []

if __name__ == “__main__”:

# Initialize processor

processor = NERProcessor()

# Text example

text = “”“

Apple Inc. CEO Tim Cook announced new partnerships with Microsoft

and Google during a conference in New York City. The event was also

attended by Sundar Pichai and Satya Nadella.

““”

# Process text

results = processor.process_text(text)

# Print results

for entity in results:

print(f“Entity: {entity[‘word’]}”)

print(f“Type: {entity[‘entity_group’]}”)

print(f“Confidence: {entity[‘score’]:.4f}”)

print(“-“ * 30)

Summary

Named Entity Recognition with BERT models provides a powerful way to extract structured information from text. The Hugging Face Transformers library makes it easy to implement NER with state-of-the-art models, whether you need a simple pipeline approach or more detailed control over the process.

In this tutorial, you learned about NER with BERT. In particular, you learned how to:

Use the pipeline API for quick prototypes and simple applications
Use explicit model handling for more control and custom processing
Consider performance optimization for production applications
Always handle edge cases and implement proper error handling

With these tools and techniques, you can build robust NER systems for various applications, from information extraction to document processing and more.

How to Do Named Entity Recognition (NER) with a BERT Model

Overview

The Complexity of NER Systems

The Evolution of NER Technology

BERT’s Revolutionary Approach to NER

Contextual Understanding

Tokenization and Subword Units

The IOB Tagging Mechanism

Using DistilBERT with Hugging Face’s Pipeline

Using DistilBERT Explicitly with AutoModelForTokenClassification

Best Practices for NER Implementation

Summary

Recent Articles

Why cyber attackers are targeting your solar energy systems — and how to stop them

This Modular Phone Concept Is Xiaomi’s Plan to Kill the Camera Bump

Researchers from UCLA, UC Merced and Adobe propose METAL: A Multi-Agent Framework that Divides the Task of Chart Generation into the Iterative Collaboration among...

Optimizing Memory Usage with NumPy Arrays

89% of Enterprise GenAI Usage Is Invisible to Organizations Exposing Critical Security Risks, New Report Reveals

Related Stories

Leave A Reply Cancel reply