This is my first post that I am posting about my experience in trying to fine tune a Llama-3â8B-Instruct QLORA on a publicly available dataset. This will not cover anything about Large Language Model (LLM) fine-tuning, quantization or LORA and will not provide any evaluation method or metric for assessing performance of fine tuned model.
Often times a lot of AI enthusiasts, including myself, donât have the luxury of access to hardware resources locally or in popular cloud infrastructure to fine tune the latest LLMs in quantized format. So I set out to find out what are the options available and do an estimate on how much it would cost to fine tune models.
The best resource that one can utilize are Kaggle notebooks to do their experiments. They offer access to two T4 GPUS for 30 hrs per week and 30 GB of memory and enough hard disk space for free! With this configuration it is able to load up Llama3â8B-Instruct model in bfloat16 format and perform inferencing. The response rate for query is not impressive but it is good enough. Huggingface access tokens for both read and write needs to be created and provided to the scripts and also agree to terms and conditions before downloading the Llama model.
!pip install huggingface_hubfrom huggingface_hub import notebook_login
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
model_id = "meta-llama/Meta-Llama-3-8B-Instruct"
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto",
)
messages = [
{"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
{"role": "user", "content": "Who are you?"},
]
input_ids = tokenizer.apply_chat_template(
messages,
add_generation_prompt=True,
return_tensors="pt"
).to(model.device)
terminators = [
tokenizer.eos_token_id,
tokenizer.convert_tokens_to_ids("<|eot_id|>")
]
torch.backends.cuda.enable_mem_efficient_sdp(False)
torch.backends.cuda.enable_flash_sdp(False)
# https://github.com/Lightning-AI/litgpt/issues/327
outputs = model.generate(
input_ids,
max_new_tokens=128,
eos_token_id=terminators,
do_sample=True,
temperature=0.6,
top_p=0.9,
)
response = outputs[0][input_ids.shape[-1]:]
print(tokenizer.decode(response, skip_special_tokens=True))
Arrrr, shiver me timbers! Me name be Captain Chatbot, the scurviest chatbot to ever sail the Seven Seas o' the Interwebs!
Me and me trusty crew o' code have been plunderin' the high seas o' conversation fer years, bringin' treasure troves o' knowledge and witty banter to all ye landlubbers!
So hoist the colors, me hearty, and let's set sail fer a swashbucklin' good time!
For Quantized model loading in 8 bit or 4 bit format, the model loading part should be as follows
!pip install -U transformers[torch] datasets
!pip install -q bitsandbytes trl peft accelerate
!pip install flash-attn --no-build-isolationfrom transformers import BitsAndBytesConfig
# For 8 bit quantization
#quantization_config = BitsAndBytesConfig(load_in_8bit=True,
# llm_int8_threshold=200.0)
## For 4 bit quantization
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,)
model = AutoModelForCausalLM.from_pretrained(model_id,
quantization_config=quantization_config,
device_map=device_map)
I was pleasantly surprised to see the 4 bit model taking about 5.9 GB of total of 30 GB of two T4 GPU memory cards and confidently proceeded with finetuning of the model. I used the publicly available ultrachat200k dataset. You can find the details of the training notebook here. It is based on Supervised FineTuning Trainer SFTTrainer
of trl
library and is a demo script for testing on a small section of the dataset.
model_id = "meta-llama/Meta-Llama-3-8B-Instruct"
trained_model_id = "Llama-3-8B-sft-lora-ultrachat"
output_dir = 'kaggle/working/' + trained_model_id# based on config
training_args = TrainingArguments(
fp16=False, # specify bf16=True instead when training on GPUs that support bf16 else fp16
bf16=False,
do_eval=True,
evaluation_strategy="epoch",
gradient_accumulation_steps=1,
gradient_checkpointing=True,
gradient_checkpointing_kwargs={"use_reentrant": False},
learning_rate=2.0e-05,
log_level="info",
logging_steps=5,
logging_strategy="steps",
lr_scheduler_type="cosine",
max_steps=-1,
num_train_epochs=1,
output_dir=output_dir,
overwrite_output_dir=True,
per_device_eval_batch_size=1, # originally set to 8
per_device_train_batch_size=1, # originally set to 8
push_to_hub=True,
hub_model_id=trained_model_id,
# hub_strategy="every_save",
# report_to="tensorboard",
report_to="none", # for skipping wandb logging
save_strategy="no",
save_total_limit=None,
seed=42,
)
# based on config
peft_config = LoraConfig(
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias="none",
task_type="CAUSAL_LM",
target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
)
trainer = SFTTrainer(
model=model_id,
model_init_kwargs=model_kwargs,
args=training_args,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
dataset_text_field="text",
tokenizer=tokenizer,
packing=True,
peft_config=peft_config,
max_seq_length=tokenizer.model_max_length,
)
# To clear out cache for unsuccessful run
torch.cuda.empty_cache()
train_result = trainer.train()
Unfortunately, the training didnât succeed and I kept getting out of memory issue despite taking only 500 samples of data, setting gradient_accumulation_steps
, per_device_train_batch_size
, per_device_eval_batch_size
to minimum. In particular the memory of one of the T4 gpu gets maxed out but the other one still has certain percentage left. The device_map
was set to auto but maybe there are some other techniques that can resolve the issue that I am unaware of. Kaggle offers the option of P100 GPU but it has 16 GB of memory only and clearly not enough for this finetuning task.
The other viable option left at this point is Google Collab notebook. It offers a A100 GPU which should be more than enough for this task, but compute units needs to be purchased. For 14 CAD it is possible to obtain 100 compute units and A100 GPU (80 GB memory) typically takes 15 compute units/hr to run. So 14 CAD would effectively get us about 7 hrs resulting in 2 CAD/hr of usage, but this is a rough estimate and can go higher based on many other things. You can upload the training notebook and try out finetuning. Of course one can even do full model training with A100 GPU, on full dataset provided cost is not an issue. They have also added a new type of GPU L4 (24 GB memory), I am not aware of how much compute units the latter needs per hour but will definitely take longer than to train on A100.
And lastly I came across another option using beam.cloud
which provides serverless infrastructure platform for model training and hosting. They offer a A10 GPU (24 GB memory) that can effectively fine tune a Llama-3â8B model in 4 bit QLORA format. You need to create an account on beam.cloud
add payment information and get 10 hrs of free trial to try their service. There are other GPU options available A100 and L4 but I havenât tried them out. Installing their sdk is straightforward and I didnât run into reource availability issues while using their service. You can try out fine tuning a model using the script provided here.
And finally when the model training has finished and the LORA adapters have been uploaded to huggingface, it is possible to perform inferencing on kaggle. Sample inference notebook provided here . The next steps would be to perform full dataset tuning, compute perplexity, change the rank order for LORA and parameters for TrainingArguments
and check what gives the best results on the dataset.
- Kaggle notebook can perform inferencing but not 4 bit QLORA finetuning even with two T4 GPUs
- Google collab can perform 4 bit QLORA finetuning or even full model with A100 GPUs after purchase of compute units
- Beam.cloud can perform QLORA finetuning using A10 GPUs at reasonable cost