Named Entity Recognition (NER) is one of the fundamental building blocks of natural language understanding. When humans read text, we naturally identify and categorize named entities based on context and world knowledge. For instance, in the sentence “Microsoft’s CEO Satya Nadella spoke at a conference in Seattle,” we effortlessly recognize the organizational, personal, and geographical references. However, teaching machines to replicate this seemingly intuitive human capability presents several challenges. Fortunately, this problem can be addressed effectively using a pretrained machine learning model.
In this post, you will learn how to solve the NER problem with a BERT model using just a few lines of Python code.
Let’s get started.
How to Do Named Entity Recognition (NER) with a BERT Model
Picture by Jon Tyson. Some rights reserved.
Overview
This post is in six parts; they are:
- The Complexity of NER Systems
- The Evolution of NER Technology
- BERT’s Revolutionary Approach to NER
- Using DistilBERT with Hugging Face’s Pipeline
- Using DistilBERT Explicitly with AutoModelForTokenClassification
- Best Practices for NER Implementation
The Complexity of NER Systems
The challenge of Named Entity Recognition extends far beyond simple pattern matching or dictionary lookups. Several key factors contribute to its complexity.
One of the most significant challenges is context dependency—understanding how words change meaning based on surrounding text. The same word can represent different entity types depending on its context. Consider these examples:
- “Apple announced new products.” (Apple is an organization.)
- “I ate an apple for lunch.” (Apple is a common noun, not a named entity.)
- “Apple Street is closed.” (Apple is a location.)
Named entities often consist of multiple words, making boundary detection another challenge. Entity names can be complex, such as:
- Corporate entities: “Bank of America Corporation”
- Product names: “iPhone 14 Pro Max”
- Person names: “Martin Luther King Jr.”
Additionally, language is dynamic and continuously evolving. Instead of memorizing what qualifies as an entity, models must deduce it from context. Language evolution introduces new entities, such as emerging companies, new products, and newly coined terms.
Now, let’s explore how state-of-the-art NER models address these challenges.
The Evolution of NER Technology
The evolution of NER technology reflects the broader advancement of natural language processing. Early approaches relied on rule-based systems and pattern matching—defining grammatical patterns, identifying capitalization, and using contextual markers (e.g., “the” before a proper noun). However, these rules were often numerous, inconsistent, and difficult to scale.
To improve accuracy, researchers introduced statistical approaches, leveraging probability-based models such as Hidden Markov Models (HMMs) and Conditional Random Fields (CRFs) to identify named entities.
With the rise of deep learning, neural networks became the preferred method for NER. Initially, bidirectional LSTM networks showed promise. However, the introduction of attention mechanisms and transformer-based models proved to be even more effective.
BERT’s Revolutionary Approach to NER
BERT (Bidirectional Encoder Representations from Transformers) has fundamentally transformed NER with several key innovations:
Contextual Understanding
Unlike traditional models that process text in one direction, BERT’s bidirectional nature allows it to consider both preceding and following text. This enables it to capture long-range dependencies, understand subtle contextual nuances, and handle ambiguous cases more effectively.
Tokenization and Subword Units
While not exclusive to BERT, its subword tokenization strategy allows it to handle unknown words while preserving morphological information. This reduces vocabulary size and makes the model adaptable across different languages and domains.
The IOB Tagging Mechanism
NER results can be represented in various ways, but BERT uses the Inside-Outside-Beginning (IOB) tagging scheme:
- B marks the beginning of an entity.
- I indicates the continuation of an entity.
- O signifies non-entities.
This method enables BERT to handle multi-word entities, nested entities, and overlapping entities effectively.
Using DistilBERT with Hugging Face’s Pipeline
The easiest way to perform NER is by using Hugging Face’s pipeline
API, which abstracts away much of the complexity while still delivering powerful results. Here’s an example:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
from transformers import pipeline
# Initialize the NER pipeline ner_pipeline = pipeline(“ner”, model=“dbmdz/bert-large-cased-finetuned-conll03-english”, aggregation_strategy=“simple”)
# Text example text = “Apple CEO Tim Cook announced new iPhone models in California yesterday.”
# Perform NER entities = ner_pipeline(text)
# Print the results for entity in entities: print(f“Entity: {entity[‘word’]}”) print(f“Type: {entity[‘entity_group’]}”) print(f“Confidence: {entity[‘score’]:.4f}”) print(“-“ * 30) |
Now, let’s break down this code in detail. First, you initialize the pipeline:
ner_pipeline = pipeline(“ner”, model=“dbmdz/bert-large-cased-finetuned-conll03-english”, aggregation_strategy=“simple”) |
The pipeline()
function creates a ready-to-use NER pipeline. This is essential because, while BERT is a machine learning model, text must be preprocessed before it can be processed by the model. Additionally, the model’s output needs to be converted into a usable format. A pipeline handles these steps automatically.
The argument "ner"
specifies you want Named Entity Recognition and model="dbmdz/bert-large-cased-finetuned-conll03-english"
loads a pre-trained model fine-tuned specifically for NER. The final argument, aggregation_strategy="simple"
, ensures that subwords are merged into complete words, making the output more readable.
The pipeline above returns a list of dictionaries, where each dictionary contains:
word
: The detected entity textentity_group
: The type of entity (e.g.,PER
for person,ORG
for organization)score
: Confidence score between 0 and 1start
andend
: Character positions in the original text
This code will output the following:
Entity: Apple Type: ORG Confidence: 0.9987 —————————— Entity: Tim Cook Type: PER Confidence: 0.9956 —————————— Entity: California Type: LOC Confidence: 0.9934 —————————— |
Using DistilBERT Explicitly with AutoModelForTokenClassification
For greater control over the NER process, you can bypass the pipeline and work directly with the model and tokenizer. This approach provides more flexibility and insight into the process. Here’s an example:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 |
from transformers import AutoTokenizer, AutoModelForTokenClassification import torch
# Load model and tokenizer model_name = “dbmdz/bert-large-cased-finetuned-conll03-english” tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForTokenClassification.from_pretrained(model_name)
# Text example text = “Google and Microsoft are competing in the AI space while Elon Musk founded SpaceX.”
# Tokenize the text inputs = tokenizer(text, return_tensors=“pt”, add_special_tokens=True)
# Get predictions with torch.no_grad(): outputs = model(**inputs) predictions = torch.argmax(outputs.logits, dim=2)
# Convert predictions to labels label_list = model.config.id2label tokens = tokenizer.convert_ids_to_tokens(inputs[“input_ids”][0]) predictions = predictions[0].tolist()
# Process results current_entity = [] current_entity_type = None
for token, prediction in zip(tokens, predictions): if token.startswith(“##”): if current_entity: current_entity.append(token[2:]) else: if current_entity: print(f“Entity: {”.join(current_entity)}”) print(f“Type: {current_entity_type}”) print(“-“ * 30) current_entity = []
if label_list[prediction] != “O”: current_entity = [token] current_entity_type = label_list[prediction]
# Print final entity if exists if current_entity: print(f“Entity: {”.join(current_entity)}”) print(f“Type: {current_entity_type}”) |
This implementation is more detailed. Let’s walk through it step by step. First, you load the model and tokenizer:
model_name = “dbmdz/bert-large-cased-finetuned-conll03-english” tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForTokenClassification.from_pretrained(model_name) |
The AutoTokenizer
class automatically selects the appropriate tokenizer based on the model card, ensuring compatibility. Tokenizers are responsible for transforming input text into tokens. AutoModelForTokenClassification
loads a model fine-tuned for token classification tasks, including both the model architecture and pre-trained weights.
Next, you preprocess the input text:
inputs = tokenizer(text, return_tensors=“pt”, add_special_tokens=True) |
This step converts the text into token IDs that the model can process. A token is typically a word but can also be a subword. For example, “sub-” and “-word” may be recognized separately even though they appear as a single word. The return_tensors="pt"
argument returns the sequence as PyTorch tensors, while add_special_tokens=True
ensures the inclusion of [CLS]
and [SEP]
tokens to the beginning and the end of the output, which are required by BERT.
Then, you run the model on the input tensor:
with torch.no_grad(): outputs = model(**inputs) predictions = torch.argmax(outputs.logits, dim=2) |
Using torch.no_grad()
disables gradient calculation during inference, saving both time and memory. The function torch.argmax(outputs.logits, dim=2)
selects the most likely label for each token. The tensor predictions
is a tensor of integers.
To convert the model’s output into human-readable text, we prepare a mapping between prediction indices and actual entity labels:
label_list = model.config.id2label tokens = tokenizer.convert_ids_to_tokens(inputs[“input_ids”][0]) predictions = predictions[0].tolist() |
Dictionary model.config.id2label
is a mapping of prediction indices to actual entity labels. The function convert_ids_to_tokens
converts integer token IDs back to readable text. Since you run the model with a single line of input text, only a sequence of output is expected. We convert the predictions to a Python list for easier processing.
Finally, you reconstruct the entity predictions using a loop. Since BERT’s tokenizer sometimes splits words into subwords (indicated by "##"
), you merge them back into complete words. The entity type is determined using the label_list
dictionary.
Best Practices for NER Implementation
Performing Named Entity Recognition (NER) is as simple as shown above. However, you are not required to use the exact code provided. Specifically, you can switch between different models (along with the corresponding tokenizer). If you need faster processing, consider using a DistilBERT model. If accuracy is a priority, opt for a larger BERT or RoBERTa model. Additionally, if your input requires domain-specific knowledge, you may benefit from using a domain-adapted model.
If you need to process a large volume of text for NER, you can improve efficiency by processing inputs in batches. Other techniques, such as using a GPU for acceleration or caching results for frequently accessed texts, can further enhance performance.
In a production system, proper error-handling logic should also be implemented. This includes validating input, handling edge cases such as empty strings and special characters, and addressing other potential issues.
Here’s a complete example incorporating these best practices:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 |
from transformers import pipeline import torch import logging from typing import List, Dict
class NERProcessor: def __init__(self, model_name: str = “dbmdz/bert-large-cased-finetuned-conll03-english”, confidence_threshold: float = 0.8): self.confidence_threshold = confidence_threshold try: self.device = “cuda” if torch.cuda.is_available() else “cpu” self.ner_pipeline = pipeline(“ner”, model=model_name, aggregation_strategy=“simple”, device=self.device) except Exception as e: logging.error(f“Failed to initialize NER pipeline: {str(e)}”) raise
def process_text(self, text: str) -> List[Dict]: if not text or not isinstance(text, str): logging.warning(“Invalid input text”) return []
try: # Get predictions entities = self.ner_pipeline(text)
# Post-process results filtered_entities = [ entity for entity in entities if entity[‘score’] >= self.confidence_threshold ]
return filtered_entities except Exception as e: logging.error(f“Error processing text: {str(e)}”) return []
if __name__ == “__main__”: # Initialize processor processor = NERProcessor()
# Text example text = “”“ Apple Inc. CEO Tim Cook announced new partnerships with Microsoft and Google during a conference in New York City. The event was also attended by Sundar Pichai and Satya Nadella. ““”
# Process text results = processor.process_text(text)
# Print results for entity in results: print(f“Entity: {entity[‘word’]}”) print(f“Type: {entity[‘entity_group’]}”) print(f“Confidence: {entity[‘score’]:.4f}”) print(“-“ * 30) |
Summary
Named Entity Recognition with BERT models provides a powerful way to extract structured information from text. The Hugging Face Transformers library makes it easy to implement NER with state-of-the-art models, whether you need a simple pipeline approach or more detailed control over the process.
In this tutorial, you learned about NER with BERT. In particular, you learned how to:
- Use the pipeline API for quick prototypes and simple applications
- Use explicit model handling for more control and custom processing
- Consider performance optimization for production applications
- Always handle edge cases and implement proper error handling
With these tools and techniques, you can build robust NER systems for various applications, from information extraction to document processing and more.