PEFT fine tuning of Llama 3 on SageMaker HyperPod with AWS Trainium

Training large language models (LLMs) models has become a significant expense for businesses. For many use cases, companies are looking to use LLM foundation models (FM) with their domain-specific data. However, companies are discovering that performing full fine tuning for these models with their data isn’t cost effective. To reduce costs while continuing to use the power of AI, many companies have shifted to fine tuning LLMs on their domain-specific data using Parameter-Efficient Fine Tuning (PEFT). PEFT is a set of techniques designed to adapt pre-trained LLMs to specific tasks while minimizing the number of parameters that need to be updated. Techniques such as Low-Rank Adaptation (LoRA) and Weighted-Decomposed Low Rank Adaptation (DoRA), significantly reducing the number of trainable parameters and resulting in lower costs for fine tuning.

In addition to cost, performing fine tuning for LLMs at scale presents significant technical challenges. The process of setting up and configuring a distributed training environment can be complex, requiring expertise in server management, cluster configuration, networking and distributed computing. Manually managing such complexity can often be counter-productive and take away valuable resources from your businesses AI development. To simplify infrastructure setup and accelerate distributed training, AWS introduced Amazon SageMaker HyperPod in late 2023.

In this blog post, we showcase how you can perform efficient supervised fine tuning for a Meta Llama 3 model using PEFT on AWS Trainium with SageMaker HyperPod. We use HuggingFace’s Optimum-Neuron software development kit (SDK) to apply LoRA to fine-tuning jobs, and use SageMaker HyperPod as the primary compute cluster to perform distributed training on Trainium. Using LoRA supervised fine-tuning for Meta Llama 3 models, you can further reduce your cost to fine tune models by up to 50% and reduce the training time by 70%.

Solution overview

SageMaker HyperPod is designed to help reduce the time required to train generative AI FMs by providing a purpose-built infrastructure for distributed training at scale. When using SageMaker HyperPod for training, SageMaker will actively monitor the cluster’s health, automatically replacing faulty nodes and resuming model training from checkpoints. The clusters come pre-configured with SageMaker distributed training libraries that enable you to split your training data and model across thousands of compute nodes, allowing data to be processed in parallel while fully utilizing the cluster’s compute and network infrastructure. You can also customize your distributed training. The architecture diagram that follows provides a high level overview of these various components:

Compute cluster: This contains a head node that orchestrates computation across a cluster of worker nodes. Because the head node is only facilitating the training, it’s typically a much smaller instance. In this post, we use Amazon Elastic Compute Cloud (Amazon EC2) Trn1 instances for the worker nodes and a single Amazon EC2 C5 instance for the head node.
Shared Volume: FSx for Lustre is used as the shared storage volume across nodes to maximize data throughput. It’s mounted at /fsx on the head and compute nodes.
External storage: Amazon Simple Storage Service (Amazon S3) is used to store the cluster’s lifecycle scripts, configuration files, datasets, and checkpoints.
Scheduler: SLURM is used as the job scheduler for the cluster.

Trainium chips are purpose-built for deep learning training of 100 billion and larger parameter models. Model training on Trainium is supported by the AWS Neuron SDK, which provides compiler, runtime, and profiling tools that unlock high-performance and cost-effective deep learning acceleration. To learn more about Trainium chips and the Neuron SDK, see Welcome to AWS Neuron.

To integrate Trainium chips with existing models and tools provided through the transformers package, Hugging Face’s Optimum-Neuron package functions as an interface with Neuron. With Optimum-Neuron, users can apply techniques such as LoRA to their fine-tuning jobs, streamlining the process of adapting LLMs for specific tasks while capitalizing on the performance gains provided by the AWS infrastructure.

Traditional fine tuning involves modifying all the parameters of a model, which can be computationally expensive and memory intensive. PEFT approaches such as LoRA focus on introducing a smaller set of trainable parameters, often in the form of low-rank matrices that adjust the model’s behavior while keeping most of its parameters frozen. The advantage of LoRA lies in its ability to maintain the performance of the base model while significantly lowering the computational burden and resource requirements. The Neuron 2.20 release supports model training with LoRA on Trainium.

In the next section, we’ll walk through the code in three steps for PEFT on Trainium with HyperPod:

Setting up and deploying a HyperPod cluster for distributed training.
Fine tuning a Meta Llama 3-8B model on Trainium instance with the dolly 15k dataset.
Model weights consolidation and inference.

Amazon SageMaker HyperPod cluster setup

In this first section, you will begin setting up your Amazon SageMaker HyperPod compute environment for fine tuning.

Prerequisites

The following are the prerequisites for configuring and deploying a SageMaker HyperPod cluster for fine tuning:

Step 1: Infrastructure setup

After completing the prerequisites, deploy an AWS CloudFormation stack that contains the necessary infrastructure components for distributed training through SageMaker HyperPod. The default Region specified in the template is us-west-2, but you can modify that. You will also need to specify the Availability Zone where your subnets will be deployed. The template configures your environment with an Amazon Virtual Private Cloud (Amazon VPC) and corresponding public and private subnets for network isolation. It establishes additional components inside your VPC including an S3 bucket for lifecycle scripts and FSx for Lustre, a file system shared across the head and compute nodes of the HyperPod cluster.

Step 2: Cluster configuration

Configure and deploy the HyperPod cluster. Begin by defining your infrastructure’s environment variables through the create_config script. This script uses the AWS CLI to extract infrastructure component variables from your CloudFormation stack including Region, resource IDs, and Amazon Resource Name (ARN).

# Set region
export AWS_REGION=us-west-2

# Fetch create_config script
curl 'https://static.us-east-1.prod.workshops.aws/public/05a78a77-24f9-4f29-867c-64c9687646e1/static/scripts/create_config.sh' --output create_config.sh

# Set environment variables
bash create_config.sh
source env_vars

After setting your environment variables, download the lifecycle scripts required for bootstrapping the compute nodes on your SageMaker HyperPod cluster and define its configuration settings before uploading the scripts to your S3 bucket.

# Download Lifecycle scripts
git clone --depth=1 https://github.com/aws-samples/awsome-distributed-training/

# upload scripts to s3
aws s3 cp --recursive awsome-distributed-training/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/ s3://${BUCKET}/src

After uploading the Lifecycle scripts to Amazon S3, create your cluster and file system configurations. See the Create Cluster section of the SageMaker HyperPod workshop to create these files. After generating the cluster-config.json and provisioning_parameters.json configuration files, validate them and upload the FSx for Lustre configuration file to Amazon S3.

# validate and check config for known issues
curl -O https://raw.githubusercontent.com/aws-samples/awsome-distributed-training/main/1.architectures/5.sagemaker-hyperpod/validate-config.py
python3 validate-config.py --cluster-config cluster-config.json --provisioning-parameters provisioning_parameters.json

# Upload FSx configuration to S3
aws s3 cp provisioning_parameters.json s3://${BUCKET}/src/

Step 3: Cluster deployment

Now that the cluster’s configuration is defined, you can create the cluster.

aws sagemaker create-cluster \
--cli-input-json file://cluster-config.json \
--region $AWS_REGION

You should be able to see your cluster by navigating to SageMaker Hyperpod in the AWS Management Console and see a cluster named ml-cluster listed. After a few minutes, its status should change from Creating to InService.

SageMaker Console

If you select your cluster, you will be able to see the details of your compute cluster including the head and worker nodes.

SageMaker Console

After installing the Systems Manager Session Manager plugin, you can ssh into your cluster’s head node using the easy-ssh script to begin training.

# Modify permissions and ssh
chmod +x easy-ssh.sh
./easy-ssh.sh -c controller-machine ml-cluster

# Switch to ubuntu user
sudo su - ubuntu

# Change directory
cd /fsx

Now that your cluster is running and accessible through ssh, you can begin uploading the model training scripts to the shared file system through either curl or the AWS CLI. For more instructions on setting up your cluster, see the SageMaker HyperPod workshop.

Fine tuning

Now that your SageMaker HyperPod cluster is deployed, you can start preparing to execute your fine tuning job.

Data preparation

The foundation of successful language model fine tuning lies in properly structured and prepared training data. This implementation focuses on instruction-tuned datasets, which form the backbone of modern language model adaptation. These datasets work together to create meaningful training examples through three essential components:

Instructions that guide the model’s task.
Optional context that provides background information.
Responses that represent the desired output.

Training begins by loading your dataset and formatting your dataset examples with this structure. Loading your dataset can be accomplished through the Hugging Face datasets library, which provides a straightforward interface for accessing and managing training data. Hugging Face also provides this format function for the databricks-dolly-15k dataset. Note that the format function needs to be embedded in your train.py file (as shown in the following sample). It’s referenced by the NeuronSFTTrainer to format your dataset during fine tuning.

# Load dataset
dataset = load_dataset(args.dataset, split="train")

def format_dolly(examples):
    output_text = []
    for i in range(len(examples["instruction"])):
        instruction = f"### Instruction\n{examples['instruction'][i]}"
        context = f"### Context\n{examples['context'][i]}" if examples["context"][i] else None
        response = f"### Answer\n{examples['response'][i]}"
        prompt = "\n\n".join([i for i in [instruction, context, response] if i is not None])
        output_text.append(prompt)
    return output_text

The formatting function employs delimiter tokens ("###") to create clear boundaries between different components of each training example. This separation is important because it helps the model distinguish between different parts of the input during training. The function handles cases where context might be missing, making sure that the final format remains consistent regardless of whether all components are present. Double newlines between sections provide additional structural clarity that helps the model recognize the natural breaks in the input.

Tokenization

After formatting your dataset, the next step is tokenization—the process of converting your text data into a numerical format that your model can understand. Tokenization serves as the bridge between your human-readable text and the mathematical operations that drive your model’s understanding of language. To begin, you use Hugging Face’s AutoTokenizer to load your model’s tokenizer.

tokenizer = AutoTokenizer.from_pretrained(args.tokenizer_path)
tokenizer.pad_token = tokenizer.eos_token

The AutoTokenizer class automatically selects the appropriate tokenizer for your model, loading not just the vocabulary, but also the rules and special tokens that match your training configuration. The assignment of the padding token to match the end-of-sequence token is particularly important for causal language modeling, because it verifies the consistent handling of your variable-length sequences.

The tokenization process itself operates in several stages. First, it breaks down your input text into tokens based on its vocabulary. These tokens are then converted to numerical IDs that your model can process. During this process, your tokenizer also handles special tokens that mark the beginning and end of sequences, in addition to padding tokens that make sure that the sequences in your batch have the same length.

When working with tokenizers, your sequence length management becomes a critical consideration. Your maximum sequence length must balance between preserving enough information for your model to understand the context and staying within your model’s architectural limitations. Too short, and you risk losing important context; too long, and you might exceed memory constraints or introduce unnecessary computational overhead.

Model compilation and fine tuning

For this solution, you created a SageMaker HyperPod cluster with the controller node and one worker node. The worker node contains one ml.trn1.32xlarge instance which has 32 Neuron cores. You can conduct distributed fine tuning using all 32 Neuron cores within the worker node.

Step 1: Environment setup

You first need to install the required Python packages for fine tuning. The following is the bash script for the Python environment setup. Note that the solution uses the most recently released Neuron SDK. From the HOME directory, create a file touch environment.sh with the following code and run it with sbatch ./environment.sh. You might need to modify the permissions of the shell scripts throughout this post before running them with the command chmod +x environment.sh.

#!/usr/bin/env bash
#SBATCH --nodes=1
#SBATCH --exclusive
#SBATCH -o /fsx/ubuntu/peft_ft/logs/8b/environment.out

sudo apt install -y python3.8-venv git
python3.8 -m venv $HOME/peft_ft/env_llama3_8B_peft
source $HOME/peft_ft/env_llama3_8B_peft/bin/activate
pip install -U pip

python3 -m pip config set global.extra-index-url "https://pip.repos.neuron.amazonaws.com"
python3 -m pip install torch-neuronx==2.1.2.2.3.0 neuronx-cc==2.15.128.0 neuronx_distributed==0.9.0 torchvision
python3 -m pip install datasets transformers peft huggingface_hub trl PyYAML
python3 -m pip install git+https://github.com/huggingface/optimum-neuron.git

With your environment created, switch to your fine-tuning directory before proceeding to the next step: cd $HOME/peft_ft.

Step 1: Download the base Llama 3 8B model and tokenizer from Hugging Face

Download the base Meta Llama 3 8B model and the corresponding tokenizer from Hugging Face. You will need to first request access for the model from Meta on Hugging Face and then use your Hugging Face access token to download the model. The following is the Python code for the get_model.py script to download the model and tokenizer. Create this file with touch get_model.py and copy the following code to this file before moving on to the next step.

import os
import argparse
from transformers import AutoTokenizer, LlamaForCausalLM

def download_model_and_tokenizer(model_id: str, model_output_path: str, tokenizer_output_path: str, huggingface_token: str = None) -> None:
    huggingface_token = os.environ.get("HUGGINGFACE_TOKEN", None)
    model = LlamaForCausalLM.from_pretrained(model_id, token=huggingface_token)
    model.save_pretrained(model_output_path)
    tokenizer = AutoTokenizer.from_pretrained(model_id, token=huggingface_token)
    tokenizer.save_pretrained(tokenizer_output_path)
    
if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--model_id", type=str, required=True, help="Hugging Face Model id")
    parser.add_argument("--model_output_path", type=str, required=True, help="Path to save model/weights file")
    parser.add_argument("--tokenizer_output_path", type=str, required=True, help="Path to save tokenizer file")
    args, _ = parser.parse_known_args()
    download_model_and_tokenizer(model_id=args.model_id, model_output_path=args.model_output_path, tokenizer_output_path=args.tokenizer_output_path)

Next, create the bash script touch get_model.sh with the code that follows and run it with the command sbatch ./get_model.sh. This will trigger the get_model.py script to download the model and tokenizer using Slurm. Because you’re using the Llama 3 8B model, Hugging Face requires you to authenticate with an access token prior to download. Be sure to add your access token to get_model.sh before running the script.

#!/bin/bash
#SBATCH --nodes=1
#SBATCH --exclusive
#SBATCH -o /fsx/ubuntu/peft_ft/logs/8b/get_model.out

export OMP_NUM_THREADS=1
export HUGGINGFACE_TOKEN="<YOUR TOKEN HERE>"
source $HOME/peft_ft/env_llama3_8B_peft/bin/activate

srun python3 $HOME/peft_ft/get_model.py \
--model_id meta-llama/Meta-Llama-3-8B-Instruct \
--model_output_path $HOME/peft_ft/model_artifacts/llama3-8B \
--tokenizer_output_path $HOME/peft_ft/tokenizer/llama3-8B

Step 2: Pre-compile model

Training deep learning models on Trainium requires model compilation. To do that, use the neuron_parallel_compile CLI utility, which will extract graphs from a trial run of your script, and perform parallel pre-compilation of the computation graphs. Note that the scripts for model pre-compilation are identical to those for the actual training, except for max_steps. This is because pre-compilation doesn’t require the completion of the entire training cycle; rather, it necessitates approximately 10 training steps to extract the graphs. Before compiling the model, you need to create the training script, touch train.py which is used for both pre-compilation and model fine tuning steps. Add the following code after creating the file, along with the format function previously mentioned.

import os
import torch
import argparse
from datasets import load_dataset
from peft import LoraConfig
from transformers import AutoModelForCausalLM, AutoTokenizer, set_seed
from optimum.neuron import NeuronSFTConfig, NeuronSFTTrainer
from optimum.neuron.distributed import lazy_load_for_parallelism
import torch_xla.core.xla_model as xm

# add format_dolly function here

def training_function(args):
    dataset = load_dataset(args.dataset, split="train")    
    tokenizer = AutoTokenizer.from_pretrained(args.tokenizer_path)
    tokenizer.pad_token = tokenizer.eos_token
    with lazy_load_for_parallelism(tensor_parallel_size=args.tp_size):
        model = AutoModelForCausalLM.from_pretrained(
            args.model_path, 
            low_cpu_mem_usage=True, 
            torch_dtype=torch.bfloat16 if args.bf16 else torch.float32
        )

    lora_config = LoraConfig(
        r=16,
        lora_alpha=16,
        lora_dropout=0.05,
        target_modules=["q_proj", "v_proj"],
        bias="none",
        task_type="CAUSAL_LM",
    )
        
    training_args = NeuronSFTConfig(
        output_dir=args.model_checkpoint_path,
        overwrite_output_dir=True,
        num_train_epochs=args.epochs,
        per_device_train_batch_size=args.train_batch_size,
        gradient_accumulation_steps=args.gradient_accumulation_steps,
        learning_rate=args.learning_rate,
        weight_decay=args.weight_decay,
        warmup_steps=args.warmup_steps,
        bf16=args.bf16,
        tensor_parallel_size=args.tp_size,
        pipeline_parallel_size=args.pp_size,
        save_steps=args.checkpoint_frequency,
        logging_steps=100,
        max_steps=args.max_steps,
        )

    trainer = NeuronSFTTrainer(
        args=training_args,
        model=model,
        peft_config=lora_config,
        tokenizer=tokenizer,
        train_dataset=dataset,
        formatting_func=format_dolly,
    )

    trainer.train()
    trainer.save_model(args.model_final_path)
    
if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--model_path", type=str)
    parser.add_argument("--tokenizer_path", type=str)
    parser.add_argument("--epochs", type=int)
    parser.add_argument("--train_batch_size", type=int)
    parser.add_argument("--learning_rate", type=float)
    parser.add_argument("--weight_decay", type=float)
    parser.add_argument("--bf16", type=bool)
    parser.add_argument("--tp_size", type=int)
    parser.add_argument("--pp_size", type=int)
    parser.add_argument("--gradient_accumulation_steps", type=int)
    parser.add_argument("--warmup_steps", type=int)
    parser.add_argument("--early_stopping_patience", type=int)
    parser.add_argument("--checkpoint_frequency", type=int)
    parser.add_argument("--dataset", type=str)
    parser.add_argument("--max_steps", type=int)
    parser.add_argument("--max_seq_length", type=int)
    parser.add_argument("--model_checkpoint_path", type=str)
    parser.add_argument("--model_final_path", type=str)
    args = parser.parse_args()
    training_function(args)

After creating the training file, use the following code to create the compile.sh script, which will trigger finetune-llama3-8B.sh to compile the Llama 3 8B model using the neuron_parallel_compile command. You can run this with the sbatch compile.sh command.

#!/bin/bash
#SBATCH --nodes=1
#SBATCH --exclusive
#SBATCH -o /fsx/ubuntu/peft_ft/logs/8b/compile.out

source $HOME/peft_ft/env_llama3_8B_peft/bin/activate

export NEURON_EXTRACT_GRAPHS_ONLY=0
srun bash ${HOME}/peft_ft/finetune-llama3-8B.sh

The following is the finetune-llama3-8B.sh script, which lists the hyper-parameters for your model fine tuning. The script uses tensor parallelism for the training with degree of 8. With 32 NeuronCores in the ml.trn1.32xlarge instance, you get data parallel of degree 4. Note that the script also sets XLA_USE_BF16=1 to map both torch.float and torch.double tensors to bfloat16 tensors. This can both reduce memory footprint and improve performance. The script then sets gradient_accumulation_steps to be 3 to get a larger effective batch size for gradient update.

#!/bin/bash
GPUS_PER_NODE=32
if [ $NEURON_EXTRACT_GRAPHS_ONLY -gt 0 ]; then
    MAX_STEPS=10
    MAYBE_COMPILE="neuron_parallel_compile"
else
    MAX_STEPS=-1
fi

declare -a TORCHRUN_ARGS=(
    --nproc_per_node=$GPUS_PER_NODE
    --nnodes=$SLURM_JOB_NUM_NODES
)
export TRAIN_SCRIPT=${HOME}/peft_ft/train.py

declare -a TRAINING_ARGS=(
    --bf16 True \
    --checkpoint_frequency 400 \
    --dataset "databricks/databricks-dolly-15k" \
    --max_steps $MAX_STEPS \
    --max_seq_length 1024 \
    --epochs 1 \
    --gradient_accumulation_steps 3 \
    --learning_rate 2e-05 \
    --model_path "/fsx/ubuntu/peft_ft/model_artifacts/llama3-8B" \
    --tokenizer_path "/fsx/ubuntu/peft_ft/tokenizer/llama3-8B" \
    --model_checkpoint_path "/fsx/ubuntu/peft_ft/model_checkpoints" \
    --model_final_path "/fsx/ubuntu/peft_ft/model_checkpoints/final" \
    --tp_size 8 \
    --pp_size 1 \
    --train_batch_size 1 \
    --warmup_steps 100 \
    --weight_decay 0.01 
)
$MAYBE_COMPILE torchrun "${TORCHRUN_ARGS[@]}" $TRAIN_SCRIPT "${TRAINING_ARGS[@]}"

Step 3: Model fine tuning

After the model compiling is complete, you can then start the model fine tuning by reusing the compile.sh script. To do this, prevent the neuron_parallel_compile utility from being used by setting export NEURON_EXTRACT_GRAPHS_ONLY=-1 in compile.sh, and then re-run the script to start fine tuning your model. You might need to delete the model_consolidation directory created during the previous model compilation step before you start your fine-tuning job.

Model consolidation

When working with distributed machine learning workflows, you’ll often need to manage and merge model weights efficiently. Let’s explore two essential processes that you’ll frequently encounter: checkpoint consolidation and weight merging when performing LoRA fine tuning.

Checkpoint consolidation

During distributed training, your model checkpoints are typically split across multiple devices according to the model parallelism configuration that you provide. To bring these pieces back together, you’ll use a consolidation process. Your consolidation function handles three primary tasks. First, it combines distributed checkpoints into a unified model. Then, it manages memory efficiently by processing tensors in chunks. Finally, it creates sharded outputs with an index file for quick access.

LoRA weight merging

When you’re working with LoRA, you need to merge these adapters with your base model. The merging process is straightforward but requires careful attention to detail. Start by loading your base model and LoRA configuration. Then transform the LoRA weight names to match your base model’s structure. The process concludes by merging the adapters and saving the final model in a sharded format.

To put these tools into practice, you can use the following scripts after your fine-tuning job has finished. First, create the Python file, touch consolidation.py and shell file, touch consolidation.sh using the following code.

import argparse
import json
from pathlib import Path
from huggingface_hub import split_torch_state_dict_into_shards
from safetensors.torch import save_file
from optimum.neuron.distributed.checkpointing import consolidate_model_parallel_checkpoints
import torch

def custom_consolidate_to_unified_checkpoint(checkpoint_dir: str, output_dir: str, save_format: str = "safetensors"):
    output_dir.mkdir(parents=True, exist_ok=True)
    state_dict = consolidate_model_parallel_checkpoints(checkpoint_dir)
    for key, value in state_dict.items():
        if isinstance(value, torch.Tensor):
            state_dict[key] = value.contiguous()

    split_result = split_torch_state_dict_into_shards(state_dict, max_shard_size="5GB")
    # Save shards
    for shard_file, shard_tensors in split_result.filename_to_tensors.items():
        shard_dict = {name: state_dict[name] for name in shard_tensors}
        shard_path = output_dir / shard_file
        if save_format == "safetensors":
            save_file(shard_dict, shard_path, metadata={"format": "pt"})
        else:
            torch.save(shard_dict, shard_path)

    index = {
        "metadata": split_result.metadata,
        "weight_map": split_result.tensor_to_filename
    }
    
    index_file = "model.safetensors.index.json" if save_format == "safetensors" else "pytorch_model.bin.index.json"
    with open(output_dir / index_file, "w") as f:
        json.dump(index, f, indent=2)

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--input_dir", type=str, required=True)
    parser.add_argument("--output_dir", type=str, required=True)
    parser.add_argument("--save_format", type=str, choices=["safetensors", "pytorch"])
    args = parser.parse_args()
    output_dir = Path(args.output_dir)
    checkpoint_dir = Path(args.input_dir) / "adapter_shards"
    custom_consolidate_to_unified_checkpoint(
        checkpoint_dir=checkpoint_dir,
        output_dir=output_dir,
        save_format=args.save_format
    )

This code will consolidate the sharded checkpoint files generated during training into a consolidated LoRA adaptersafetensor format. After saving the file, you can invoke this script to trigger the model checkpoint consolidation job. The input directory that you provide points to your fine-tuned model’s sharded checkpoints and the output directory for the consolidated LoRA adapter safetensor file. You trigger this with sbatch consolidation.sh.

#!/bin/bash
#SBATCH --nodes=1
#SBATCH --exclusive

export OMP_NUM_THREADS=1
source $HOME/peft_ft/env_llama3_8B_peft/bin/activate

srun python3 "$HOME/peft_ft/consolidation.py" \
--input_dir "/fsx/ubuntu/peft_ft/model_checkpoints/checkpoint-1251" \
--output_dir "$HOME/peft_ft/model_checkpoints/adapter_shards_consolidation"\
--save_format "safetensors"

After consolidation is complete, you need to merge the LoRA adapter weights from the consolidated files with the base model’s weights. Begin by creating a new Python file touch merge_lora.py and shell file merge_lora.sh using the following code.

import json
from peft import LoraConfig, PeftModel
from transformers import AutoModelForCausalLM
import torch
import argparse
from safetensors import safe_open


def merge_lora_weights(args):
    base_model = AutoModelForCausalLM.from_pretrained(args.base_model_path)
    with open(args.adapter_config_path, "r") as f:
        config_dict = json.load(f)
    peft_config = LoraConfig(**config_dict)
    model = PeftModel(base_model, peft_config)
    
    lora_weights_tensors = {}
    with safe_open(args.lora_safetensors_path, framework="pt", device="cpu") as f:
        for k in f.keys():
            lora_weights_tensors[k] = f.get_tensor(k)
            
    for layer_name in list(lora_weights_tensors):
        if 'layer' in layer_name and 'lora' in layer_name:
            new_layer_name = layer_name.replace('weight', 'default.weight')
            lora_weights_tensors[new_layer_name] = lora_weights_tensors[layer_name].clone()
            del lora_weights_tensors[layer_name]
        else:
            del lora_weights_tensors[layer_name]

    updated_state_dict = model.state_dict().copy()
    for layer, weights in lora_weights_tensors.items():
        updated_state_dict[layer] = weights
    model.load_state_dict(updated_state_dict)    
    merged_model = model.merge_and_unload()    
    merged_model.save_pretrained(args.final_model_path, safe_serialization=True, max_shard_size="5GB")

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--final_model_path", type=str)
    parser.add_argument("--adapter_config_path", type=str)
    parser.add_argument("--base_model_path", type=str)
    parser.add_argument("--lora_safetensors_path", type=str)
    args = parser.parse_args()
    merge_lora_weights(args)

#!/bin/bash
#SBATCH --nodes=1
#SBATCH --exclusive
#SBATCH --output=/fsx/ubuntu/peft_ft/logs/8b/lora_weights.log

export OMP_NUM_THREADS=1
source $HOME/peft_ft/env_llama3_8B_peft/bin/activate

srun python3 "$HOME/peft_ft/merge_lora.py" \
    --final_model_path "/fsx/ubuntu/peft_ft/model_checkpoints/final_model_output" \
    --adapter_config_path "/fsx/ubuntu/peft_ft/model_checkpoints/checkpoint-1251/adapter_config.json"\
    --base_model_path "/fsx/ubuntu/peft_ft/model_artifacts/llama3-8B" \
    --lora_safetensors_path "/fsx/ubuntu/peft_ft/model_checkpoints/adapter_shards_consolidation/model.safetensors"

Trigger the run with sbatch merge_lora.sh to merge the model weights. Here the base_model_path parameter is the local directory where you previously downloaded the model from Hugging Face in step 1 of “Model compilation and fine tuning.” Similarly, the adapter_config_path parameter will be the model’s configuration file previously downloaded and the lora_safetensors_path parameter will be the path to the model.safetensor file output by the LoRA consolidation in the previous step.

Inference

After consolidation and merging, the safetensors files will be saved to your final_model_path output directory containing the updated model weights after fine tuning. Using these updated weights, you can load and generate a prediction for your trained model in the context of the dolly dataset. To check that the fine-tuned model understands the databricks-dolly-15k dataset it was fine tuned on, select a question from the dataset for validation, as shown in the following figure.

Hugging Face databricks-dolly-15k dataset card

Using Hugging Face’s LlamaForCausalLM class you can load your newly fine-tuned model, and generate a prediction for the question, “Who are the Smiths?” (shown in the following figure):

Model inference generation

Comparing the generated answer to the ground truth context and response from the training dataset, it’s clear that the fine-tuned Meta Llama 3 model now understands this data and can give coherent responses to posed questions.

Results

Technique	Trainable parameters	Samples processed per second	Training time (minutes)
FPFT	7,570,591,744	2.083	90
PEFT	6,815,744	3.554	53

To benchmark the fine-tuned model’s performance with LoRA on a single ml.trn1.32xlarge, we compared it to full parameter fine tuning (FPFT) for the model over three training epochs. Measuring training samples processed per second showed a 70% increase in throughput and reduction in training time for the LoRA fine-tuned model. Subsequently, on-demand hours required to fine tune the model on the dolly 15k dataset for three epochs was halved compared to FPFT, resulting in a 50% reduction of training costs.

Clean up

To clean up the resources provisioned for this post, first delete the SageMaker HyperPod cluster. This can be done either through the AWS CLI or in the SageMaker console.

aws sagemaker delete-cluster --cluster-name ml-cluster

After the cluster is deleted, delete the CloudFormation template to delete the remaining provisioned resources.

aws cloudformation delete-stack --stack-name sagemaker-hyperpod

Conclusion

In this post, we showed you how to set up a SageMaker HyperPod compute cluster for training. Then we showed you how to perform multi-node distributed fine tuning with Trainium for a Meta Llama 3 model using LoRA. Finally, we showed you how to consolidate model weights across a distributed training environment to generate coherent predictions for the newly fine-tuned model.

About the Authors

Georgios Ioannides is a Deep Learning Architect with the AWS Generative AI Innovation Center. Before AWS, Georgios worked in startups, where he specialized in signal processing, deep learning, and multi-modal and cross-modal machine learning systems for speech, vision, and text applications. He holds Master’s degrees from Imperial College London and Carnegie Mellon University.

Bingchen Liu is a Machine Learning Engineer with the AWS Generative AI Innovation Center. Before AWS, he worked as a lead MLE in ADP focusing on RAG applications, vector database, model development, and serving. He holds a Master’s degree in Computer Science from Columbia University and a PhD in Statistics from Southern Methodist University.

Hannah Marlowe is a Senior Manager of Model Customization at the AWS Generative AI Innovation Center. Her team specializes in helping customers develop differentiating generative AI solutions using their unique and proprietary data to achieve key business outcomes. She holds a PhD in Physics from the University of Iowa, with a focus on astronomical X-ray analysis and instrumentation development. Outside of work, she can be found hiking, mountain biking, and skiing around the mountains in Colorado.

Jeremy Roghair is a Machine Learning Engineer with the AWS Generative AI Innovation Center, where he focuses on developing generative AI solutions for distributed training workloads and model hosting for customers. Prior to joining AWS, Jeremy worked as a Data Scientist in the finance/insurance industry and earned a Master’s degree in Computer Science with research in reinforcement learning from Iowa State University.