Relation Extraction with Llama3 Models | by Silvia Onofrei | Apr, 2024

Enhanced relation extraction by fine-tuning Llama3–8B with a synthetic dataset created using Llama3–70B

Towards Data Science

Generated with DALL-E.

Relation extraction (RE) is the task of extracting relationships from unstructured text to identify connections between various named entities. It is done in conjunction with named entity recognition (NER) and is an essential step in a natural langage processing pipeline. With the rise of Large Language Models (LLMs), traditional supervised approaches that involve tagging entity spans and classifying relationships (if any) between them are enhanced or entirely replaced by LLM-based approaches [1].

Llama3 is the most recent major release in the domain of GenerativeAI [2]. The base model is available in two sizes, 8B and 70B, with a 400B model expected to be released soon. These models are available on the HuggingFace platform; see [3] for details. The 70B variant powers Meta’s new chat website and exhibits performance comparable to ChatGPT. The 8B model is among the most performant in its class. The architecture of Llama3 is similar to that of Llama2, with the increase in performance primarily due to data upgrading. The model comes with an upgaded tokenizer and expanded context window. It is labelled as open-source, although only a small percentage of the data is released. Overall, it is an excellent model, and I cannot wait to give it a try.

Llama3–70B can produce amazing results, but due to its size it is impractical, prohibitively expensive and hard to use on local systems. Therefore, to leverage its capabilities, we have Llama3–70B teach the smaller Llama3–8B the task of relation extraction from unstructured text.

Specifically, with the help of Llama3–70B, we build a supervised fine-tuning dataset aimed at relation extraction. We then use this dataset to fine-tune Llama3–8B to enhance its relation extraction capabilities.

To reproduce the code in the Google Colab Notebook associated to this blog, you will need:

  • HuggingFace credentials (to save the fine-tuned model, optional) and Llama3 access, which can be obtained by following the instructions from one of the models’ cards;
  • A free GroqCloud account (you can loggin with a Google account) and a corresponding API Key.

For this project I used a Google Colab Pro equipped with an A100 GPU and a High-RAM setting.

We start by installing all the required libraries:

!pip install -q groq
!pip install -U accelerate bitsandbytes datasets evaluate
!pip install -U peft transformers trl

I was very pleased to notice that the entire setup worked from the beginning without any dependencies issues or the need to install transformers from the source, despite the novelty of the model.

We also need to give access Goggle Colab to the drive and files and set the working directory:

# For Google Colab settings
from google.colab import userdata, drive

# This will prompt for authorization

# Set the working directory
%cd '/content/drive/MyDrive/postedBlogs/llama3RE'

For those who wish to upload the model to the HuggingFace Hub, we need to upload the Hub credentials. In my case, these are stored in Google Colab secrets, which can be accessed via the key button on the left. This step is optional.

# For Hugging Face Hub setting
from huggingface_hub import login

# Upload the HuggingFace token (should have WRITE access) from Colab secrets
HF = userdata.get('HF')

# This is needed to upload the model to HuggingFace

I also added some path variables to simplify file access:

# Create a path variable for the data folder
data_path = '/content/drive/MyDrive/postedBlogs/llama3RE/datas/'

# Full fine-tuning dataset
sft_dataset_file = f'{data_path}sft_train_data.json'

# Data collected from the the mini-test
mini_data_path = f'{data_path}mini_data.json'

# Test data containing all three outputs
all_tests_data = f'{data_path}all_tests.json'

# The adjusted training dataset
train_data_path = f'{data_path}sft_train_data.json'

# Create a path variable for the SFT model to be saved locally
sft_model_path = '/content/drive/MyDrive/llama3RE/Llama3_RE/'

Now that our workspace is set up, we can move to the first step, which is to build a synthetic dataset for the task of relation extraction.

There are several relation extraction datasets available, with the best-known being the CoNLL04 dataset. Additionally, there are excellent datasets such as web_nlg, available on HuggingFace, and SciREX developed by AllenAI. However, most of these datasets come with restrictive licenses.

Inspired by the format of the web_nlg dataset we will build our own dataset. This approach will be particularly useful if we plan to fine-tune a model trained on our dataset. To start, we need a collection of short sentences for our relation extraction task. We can compile this corpus in various ways.

Gather a Collection of Sentences

We will use databricks-dolly-15k, an open source dataset generated by Databricks employees in 2023. This dataset is designed for supervised fine-tuning and includes four features: instruction, context, response and category. After analyzing the eight categories, I decided to retain the first sentence of the context from the information_extraction category. The data parsing steps are outlined below:

from datasets import load_dataset

# Load the dataset
dataset = load_dataset("databricks/databricks-dolly-15k")

# Choose the desired category from the dataset
ie_category = [e for e in dataset["train"] if e["category"]=="information_extraction"]

# Retain only the context from each instance
ie_context = [e["context"] for e in ie_category]

# Split the text into sentences (at the period) and keep the first sentence
reduced_context = [text.split('.')[0] + '.' for text in ie_context]

# Retain sequences of specified lengths only (use character length)
sampler = [e for e in reduced_context if 30 < len(e) < 170]

The selection process yields a dataset comprising 1,041 sentences. Given that this is a mini-project, I did not handpick the sentences, and as a result, some samples may not be ideally suited for our task. In a project designated for production, I would carefully select only the most appropriate sentences. However, for the purposes of this project, this dataset will suffice.

Format the Data

We first need to create a system message that will define the input prompt and instruct the model on how to generate the answers:

system_message = """You are an experienced annontator. 
Extract all entities and the relations between them from the following text.
Write the answer as a triple entity1|relationship|entitity2.
Do not add anything else.
Example Text: Alice is from France.
Answer: Alice|is from|France.

Since this is an experimental phase, I am keeping the demands on the model to a minimum. I did test several other prompts, including some that requested outputs in CoNLL format where entities are categorized, and the model performed quite well. However, for simplicity’s sake, we’ll stick to the basics for now.

We also need to convert the data into a conversational format:

messages = [[
{"role": "system","content": f"{system_message}"},
{"role": "user", "content": e}] for e in sampler]

The Groq Client and API

Llama3 was released just a few days ago, and the availability of API options is still limited. While a chat interface is available for Llama3–70B, this project requires an API that could process my 1,000 sentences with a couple lines of code. I found this excellent YouTube video that explains how to use the GroqCloud API for free. For more details please refer to the video.

Just a reminder: you’ll need to log in and retrieve a free API Key from the GroqCloud website. My API key is already saved in the Google Colab secrets. We start by initializing the Groq client:

import os
from groq import Groq

gclient = Groq(

Next we need to define a couple of helper functions that will enable us to interact with the chat interface effectively (these are adapted from the YouTube video):

import time
from tqdm import tqdm

def process_data(prompt):

"""Send one request and retrieve model's generation."""

chat_completion =
messages=prompt, # input prompt to send to the model
model="llama3-70b-8192", # according to GroqCloud labeling
temperature=0.5, # controls diversity
max_tokens=128, # max number tokens to generate
top_p=1, # proportion of likelihood weighted options to consider
stop=None, # string that signals to stop generating
stream=False, # if set partial messages are sent
return chat_completion.choices[0].message.content

def send_messages(messages):

"""Process messages in batches with a pause between batches."""

batch_size = 10
answers = []

for i in tqdm(range(0, len(messages), batch_size)): # batches of size 10

batch = messages[i:i+10] # get the next batch of messages

for message in batch:
output = process_data(message)

if i + 10 < len(messages): # check if there are batches left
time.sleep(10) # wait for 10 seconds

return answers

The first function process_data() serves as a wrapper for the chat completion function of the Groq client. The second function send_messages(), processes the data in small batches. If you follow the Settings link on the Groq playground page, you will find a link to Limits which details the conditions under which we can use the free API, including caps on the number of requests and generated tokens. To avoid exceedind these limits, I added a 10-seconds delay after each batch of 10 messages, although it wasn’t strictly necessary in my case. You might want to experiment with these settings.

What remains now is to generate our relation extraction data and integrate it with the initial dataset :

# Data generation with Llama3-70B
answers = send_messages(messages)

# Combine input data with the generated dataset
combined_dataset = [{'text': user, 'gold_re': output} for user, output in zip(sampler, answers)]

Before proceeding with fine-tuning the model, it’s important to evaluate its performance on several samples to determine if fine-tuning is indeed necessary.

Building a Testing Dataset

We will select 20 samples from the dataset we just constructed and set them aside for testing. The remainder of the dataset will be used for fine-tuning.

import random

# Select 20 random entries
mini_data = random.sample(combined_dataset, 20)

# Build conversational format
parsed_mini_data = [[{'role': 'system', 'content': system_message},
{'role': 'user', 'content': e['text']}] for e in mini_data]

# Create the training set
train_data = [item for item in combined_dataset if item not in mini_data]

We will use the GroqCloud API and the utilities defined above, specifying model=llama3-8b-8192 while the rest of the function remains unchanged. In this case, we can directly process our small dataset without concern of exceeded the API limits.

Here is a sample output that provides the original text, the Llama3-70B generation denoted gold_re and the Llama3-8B hgeneration labelled test_re.

{'text': 'Long before any knowledge of electricity existed, people were aware of shocks from electric fish.',
'gold_re': 'people|were aware of|shocks\nshocks|from|electric fish\nelectric fish|had|electricity',
'test_re': 'electric fish|were aware of|shocks'}

For the full test dataset, please refer to the Google Colab notebook.

Just from this example, it becomes clear that Llama3–8B could benefit from some improvements in its relation extraction capabilities. Let’s work on enhancing that.

We will utilize a full arsenal of techniques to assist us, including QLoRA and Flash Attention. I won’t delve into the specifics of choosing hyperparameters here, but if you’re interested in exploring further, check out these great references [4] and [5].

The A100 GPU supports Flash Attention and bfloat16, and it possesses about 40GB of memory, which is sufficient for our fine-tuning needs.

Preparing the SFT Dataset

We start by parsing the dataset into a conversational format, including a system message, input text and the desired answer, which we derive from the Llama3–70B generation. We then save it as a HuggingFace dataset:

def create_conversation(sample):
return {
"messages": [
{"role": "system","content": system_message},
{"role": "user", "content": sample["text"]},
{"role": "assistant", "content": sample["gold_re"]}

from datasets import load_dataset, Dataset

train_dataset = Dataset.from_list(train_data)

# Transform to conversational format
train_dataset =,

Choose the Model

model_id  =  "meta-llama/Meta-Llama-3-8B"

Load the Tokenizer

from transformers import AutoTokenizer

# Tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id,

tokenizer.pad_token = tokenizer.eos_token
tokenizer.pad_token_id = tokenizer.eos_token_id
tokenizer.padding_side = 'left'

# Set a maximum length
tokenizer.model_max_length = 512

Choose Quantization Parameters

from transformers import BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(

Load the Model

from transformers import AutoModelForCausalLM
from peft import prepare_model_for_kbit_training
from trl import setup_chat_format

device_map = {"": torch.cuda.current_device()} if torch.cuda.is_available() else None

model = AutoModelForCausalLM.from_pretrained(

model, tokenizer = setup_chat_format(model, tokenizer)
model = prepare_model_for_kbit_training(model)

LoRA Configuration

from peft import LoraConfig

# According to Sebastian Raschka findings
peft_config = LoraConfig(
lora_alpha=128, #32
r=256, #16
target_modules=["q_proj", "o_proj", "gate_proj", "up_proj",
"down_proj", "k_proj", "v_proj"],

The best results are achieved when targeting all the linear layers. If memory constraints are a concern, opting for more standard values such as alpha=32 and rank=16 can be beneficial, as these settings result in significantly fewer parameters.

Training Arguments

from transformers import TrainingArguments

# Adapted from Phil Schmid blogpost
args = TrainingArguments(
output_dir=sft_model_path, # directory to save the model and repository id
num_train_epochs=2, # number of training epochs
per_device_train_batch_size=4, # batch size per device during training
gradient_accumulation_steps=2, # number of steps before performing a backward/update pass
gradient_checkpointing=True, # use gradient checkpointing to save memory, use in distributed training
optim="adamw_8bit", # choose paged_adamw_8bit if not enough memory
logging_steps=10, # log every 10 steps
save_strategy="epoch", # save checkpoint every epoch
learning_rate=2e-4, # learning rate, based on QLoRA paper
bf16=True, # use bfloat16 precision
tf32=True, # use tf32 precision
max_grad_norm=0.3, # max gradient norm based on QLoRA paper
warmup_ratio=0.03, # warmup ratio based on QLoRA paper
lr_scheduler_type="constant", # use constant learning rate scheduler
push_to_hub=True, # push model to Hugging Face hub
report_to="tensorboard", # report metrics to tensorboard

If you choose to save the model locally, you can omit the last three parameters. You may also need to adjust the per_device_batch_size and gradient_accumulation_steps to prevent Out of Memory (OOM) errors.

Initialize the Trainer and Train the Model

from trl import SFTTrainer

trainer = SFTTrainer(
packing=False, # True if the dataset is large
"add_special_tokens": False, # the template adds the special tokens
"append_concat_token": False, # no need to add additional separator token


The training, including model saving, took about 10 minutes.

Let’s clear the memory to prepare for inference tests. If you’re using a GPU with less memory and encounter CUDA Out of Memory (OOM) errors, you might need to restart the runtime.

import torch
import gc
del model
del tokenizer

In this final step we will load the base model in half precision along with the Peft adapter. For this test, I have chosen not to merge the model with the adapter.

from peft import AutoPeftModelForCausalLM
from transformers import AutoTokenizer, pipeline
import torch

# HF model
peft_model_id = "solanaO/llama3-8b-sft-qlora-re"

# Load Model with PEFT adapter
model = AutoPeftModelForCausalLM.from_pretrained(

Next, we load the tokenizer:

okenizer = AutoTokenizer.from_pretrained(peft_model_id)

tokenizer.pad_token = tokenizer.eos_token
tokenizer.pad_token_id = tokenizer.eos_token_id

And we build the text generation pipeline:

pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)

We load the test dataset, which consists of the 20 samples we set aside previously, and format the data in a conversational style. However, this time we omit the assistant message and format it as a Hugging Face dataset:

def create_input_prompt(sample):
return {
"messages": [
{"role": "system","content": system_message},
{"role": "user", "content": sample["text"]},

from datasets import Dataset

test_dataset = Dataset.from_list(mini_data)

# Transform to conversational format
test_dataset =,

One Sample Test

Let’s generate relation extraction output using SFT Llama3–8B and compare it to the previous two outputs on a single instance:

 Generate the input prompt
prompt = pipe.tokenizer.apply_chat_template(test_dataset[2]["messages"][:2],
# Generate the output
outputs = pipe(prompt,
# Display the results
print(f"Question: {test_dataset[2]['messages'][1]['content']}\n")
print(f"Gold-RE: {test_sampler[2]['gold_re']}\n")
print(f"LLama3-8B-RE: {test_sampler[2]['test_re']}\n")
print(f"SFT-Llama3-8B-RE: {outputs[0]['generated_text'][len(prompt):].strip()}")

We obtain the following:

Question: Long before any knowledge of electricity existed, people were aware of shocks from electric fish.

Gold-RE: people|were aware of|shocks
shocks|from|electric fish
electric fish|had|electricity

LLama3-8B-RE: electric fish|were aware of|shocks

SFT-Llama3-8B-RE: people|were aware of|shocks
shocks|from|electric fish

In this example, we observe significant improvements in the relation extraction capabilities of Llama3–8B through fine-tuning. Despite the fine-tuning dataset being neither very clean nor particularly large, the results are impressive.

For the complete results on the 20-sample dataset, please refer to the Google Colab notebook. Note that the inference test takes longer because we load the model in half-precision.

In conclusion, by utilizing Llama3–70B and an available dataset, we successfully created a synthetic dataset which was then used to fine-tune Llama3–8B for a specific task. This process not only familiarized us with Llama3, but also allowed us to apply straightforward techniques from Hugging Face. We observed that working with Llama3 closely resembles the experience with Llama2, with the notable improvements being enhanced output quality and a more effective tokenizer.

For those interested in pushing the boundaries further, consider challenging the model with more complex tasks such as categorizing entities and relationships, and using these classifications to build a knowledge graph.

  1. Somin Wadhwa, Silvio Amir, Byron C. Wallace, Revisiting Relation Extraction in the era of Large Language Models, arXiv.2305.05003 (2023).
  2. Meta, Introducing Meta Llama 3: The most capable openly available LLM to date, April 18, 2024 (link).
  3. Philipp Schmid, Omar Sanseviero, Pedro Cuenca, Youndes Belkada, Leandro von Werra, Welcome Llama 3 — Met’s new open LLM, April 18, 2024.
  4. Sebastian Raschka, Practical Tips for Finetuning LLMs Using LoRA (Low-Rank Adaptation), Ahead of AI, Nov 19, 2023.
  5. Philipp Schmid, How to Fine-Tune LLMs in 2024 with Hugging Face, Jan 22, 2024.

databricks-dolly-15K on Hugging Face platform (CC BY-SA 3.0)

Github Repo

Recent Articles

Related Stories

Leave A Reply

Please enter your comment!
Please enter your name here