Image by Editor | Ideogram
Â
Let’s learn how to use LayoutLM with Hugging Face Transformers.
Â
Preparation
Â
In this tutorial, we will use the following packages, so install them with the following code:
pip install transformers datasets pillow
Â
Then, you need to install the PyTorch package by selecting the version that is suitable for your environment.
With the package installed, we will get into the next part.
Â
LayoutLM with Hugging Face Transformers
Â
LayoutLM is a specialized model designed for document understanding that integrates textual data and image elements. It merges the text’s content with the document’s layout to see the overall document skeleton. This model extracts necessary information from documents with defined formats, like forms, invoices, and receipts.
Let’s begin working with LayoutLM by using the sample data. This tutorial will use the FUNSD dataset, which includes forms annotated for Named Entity Recognition (NER) with categories like HEADERS, QUESTIONS, and others, along with bounding box information.
from datasets import load_dataset
dataset = load_dataset("nielsr/funsd")
example = dataset["train"][1]
Â
After that, we would download the LayoutLM tokenizer and preprocess our data.
from transformers import LayoutLMTokenizerFast
tokenizer = LayoutLMTokenizerFast.from_pretrained("microsoft/layoutlm-base-uncased")
def preprocess_example(example):
encoding = tokenizer(
example['words'],
is_split_into_words=True,
return_offsets_mapping=True,
padding="max_length",
truncation=True,
max_length=512
)
labels = []
boxes = []
for i, word_id in enumerate(encoding.word_ids()):
if word_id is None:
labels.append(-100) # Special tokens get a label of -100
boxes.append([0, 0, 0, 0])
else:
labels.append(example['ner_tags'][word_id])
boxes.append(example['bboxes'][word_id])
encoding['labels'] = labels
encoding['bbox'] = boxes
return encoding
encoding = preprocess_example(example)
Â
Next, we will download the LayoutLM model using the code below.
from transformers import LayoutLMForTokenClassification
import torch
model = LayoutLMForTokenClassification.from_pretrained("microsoft/layoutlm-base-uncased", num_labels=len(dataset["train"].features["ner_tags"].feature.names))
# Move the model to GPU if available
device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)
Â
Once we have the LayoutLM model, we can use it on the encoded sample data to examine the predicted NER tags.
import torch
input_ids = torch.tensor(encoding["input_ids"]).unsqueeze(0).to(device)
attention_mask = torch.tensor(encoding["attention_mask"]).unsqueeze(0).to(device)
bbox = torch.tensor(encoding["bbox"]).unsqueeze(0).to(device)
labels = torch.tensor(encoding["labels"]).unsqueeze(0).to(device)
with torch.no_grad():
outputs = model(input_ids=input_ids, attention_mask=attention_mask, bbox=bbox, labels=labels)
logits = outputs.logits
predicted_labels = torch.argmax(logits, dim=2)
Â
You would get the labels, but it’s not intuitive. So, we can decode the prediction to get the label name.
label_map = i: label for i, label in enumerate(dataset["train"].features["ner_tags"].feature.names)
predicted_labels = predicted_labels.cpu().numpy()[0]
decoded_labels = [label_map[label_id] for label_id in predicted_labels]
Â
Lastly, we can see how the prediction from LayoutLM is shown in the image we pass into the model.
from PIL import Image, ImageDraw, ImageFont
import matplotlib.pyplot as plt
image = Image.open(example["image_path"])
draw = ImageDraw.Draw(image)
colors =
"I-HEADER": "blue",
"I-QUESTION": "green",
"I-ANSWER": "red",
"B-HEADER": "yellow",
"B-QUESTION": "purple",
"B-ANSWER": "orange",
"O": "white"
image_width, image_height = image.size
font = ImageFont.load_default()
for box, label in zip(example["bboxes"], decoded_labels):
if label != "O":
color = colors.get(label, "blue")
scaled_box = [
box[0] * image_width / 1000,
box[1] * image_height / 1000,
box[2] * image_width / 1000,
box[3] * image_height / 1000
]
draw.rectangle(scaled_box, outline=color, width=2)
draw.text((scaled_box[0], scaled_box[1] - 10), label, fill=color, font=font)
plt.figure(figsize=(12, 12))
plt.imshow(image)
plt.axis('off')
plt.show()
Â
LayoutLM labels prediction NER tags to the bound boxes in the image. Try to master this model to help you understand and extract information from your document.
Additional Resources
Â
Â
Â
Cornellius Yudha Wijaya is a data science assistant manager and data writer. While working full-time at Allianz Indonesia, he loves to share Python and data tips via social media and writing media. Cornellius writes on a variety of AI and machine learning topics.