How to Use LayoutLM for Document Understanding and Information Extraction with Hugging Face Transformers

Image by Editor | Ideogram

Let’s learn how to use LayoutLM with Hugging Face Transformers.

Preparation

In this tutorial, we will use the following packages, so install them with the following code:

pip install transformers datasets pillow

Then, you need to install the PyTorch package by selecting the version that is suitable for your environment.

With the package installed, we will get into the next part.

LayoutLM with Hugging Face Transformers

LayoutLM is a specialized model designed for document understanding that integrates textual data and image elements. It merges the text’s content with the document’s layout to see the overall document skeleton. This model extracts necessary information from documents with defined formats, like forms, invoices, and receipts.

Let’s begin working with LayoutLM by using the sample data. This tutorial will use the FUNSD dataset, which includes forms annotated for Named Entity Recognition (NER) with categories like HEADERS, QUESTIONS, and others, along with bounding box information.

from datasets import load_dataset

dataset = load_dataset("nielsr/funsd")
example = dataset["train"][1]

After that, we would download the LayoutLM tokenizer and preprocess our data.

from transformers import LayoutLMTokenizerFast

tokenizer = LayoutLMTokenizerFast.from_pretrained("microsoft/layoutlm-base-uncased")


def preprocess_example(example):


    encoding = tokenizer(
        example['words'],
        is_split_into_words=True,
        return_offsets_mapping=True,
        padding="max_length",
        truncation=True,
        max_length=512
    )


    labels = []
    boxes = []
    for i, word_id in enumerate(encoding.word_ids()):
        if word_id is None:
            labels.append(-100)  # Special tokens get a label of -100
            boxes.append([0, 0, 0, 0])
        else:
            labels.append(example['ner_tags'][word_id])
            boxes.append(example['bboxes'][word_id])


    encoding['labels'] = labels
    encoding['bbox'] = boxes


    return encoding


encoding = preprocess_example(example)

Next, we will download the LayoutLM model using the code below.

from transformers import LayoutLMForTokenClassification
import torch

model = LayoutLMForTokenClassification.from_pretrained("microsoft/layoutlm-base-uncased", num_labels=len(dataset["train"].features["ner_tags"].feature.names))

# Move the model to GPU if available
device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)

Once we have the LayoutLM model, we can use it on the encoded sample data to examine the predicted NER tags.

import torch

input_ids = torch.tensor(encoding["input_ids"]).unsqueeze(0).to(device)
attention_mask = torch.tensor(encoding["attention_mask"]).unsqueeze(0).to(device)
bbox = torch.tensor(encoding["bbox"]).unsqueeze(0).to(device)
labels = torch.tensor(encoding["labels"]).unsqueeze(0).to(device)

with torch.no_grad():
    outputs = model(input_ids=input_ids, attention_mask=attention_mask, bbox=bbox, labels=labels)
    logits = outputs.logits

predicted_labels = torch.argmax(logits, dim=2)

You would get the labels, but it’s not intuitive. So, we can decode the prediction to get the label name.

label_map = i: label for i, label in enumerate(dataset["train"].features["ner_tags"].feature.names)
predicted_labels = predicted_labels.cpu().numpy()[0]

decoded_labels = [label_map[label_id] for label_id in predicted_labels]

Lastly, we can see how the prediction from LayoutLM is shown in the image we pass into the model.

from PIL import Image, ImageDraw, ImageFont
import matplotlib.pyplot as plt

image = Image.open(example["image_path"])

draw = ImageDraw.Draw(image)
colors = 
    "I-HEADER": "blue",
    "I-QUESTION": "green",
    "I-ANSWER": "red",
    "B-HEADER": "yellow",
    "B-QUESTION": "purple",
    "B-ANSWER": "orange",
    "O": "white"


image_width, image_height = image.size
font = ImageFont.load_default()

for box, label in zip(example["bboxes"], decoded_labels):
    if label != "O":
        color = colors.get(label, "blue")
        scaled_box = [
            box[0] * image_width / 1000,
            box[1] * image_height / 1000,
            box[2] * image_width / 1000,
            box[3] * image_height / 1000
        ]
        draw.rectangle(scaled_box, outline=color, width=2)
        draw.text((scaled_box[0], scaled_box[1] - 10), label, fill=color, font=font)

plt.figure(figsize=(12, 12))
plt.imshow(image)
plt.axis('off')
plt.show()

How to Use LayoutLM for Document Understanding and Inform
ation Extraction with Hugging Face Transformers.
LayoutLM labels prediction NER tags to the bound boxes in the image. Try to master this model to help you understand and extract information from your document.

Additional Resources

Cornellius Yudha Wijaya is a data science assistant manager and data writer. While working full-time at Allianz Indonesia, he loves to share Python and data tips via social media and writing media. Cornellius writes on a variety of AI and machine learning topics.

How to Use LayoutLM for Document Understanding and Information Extraction with Hugging Face Transformers

Preparation

LayoutLM with Hugging Face Transformers

Additional Resources

Recent Articles

Building Robust ViewModels | Kodeco

Auto-Completion Style Text Generation with GPT-2 Model

Write for Towards Data Science

Silver Fox APT Uses Winos 4.0 Malware in Cyber Attacks Against Taiwanese Organizations

Zendaya Is No Longer Meechee, But She Is Shrek

Related Stories

Leave A Reply Cancel reply