Fine Tuning LLM for Parsing and Serving Through Ollama | by Kaushik Holla | May, 2025


Source: By the Author

Last week, I signed up for the Databricks conference happening in San Francisco, eager to explore new AI innovations. While reviewing the event schedule, one particular Lightning Talk by Mastercard caught my attention — Generative AI Merchant Matching. This intriguing session highlighted an affordable, yet highly effective AI system developed to translate confusing merchant identifiers found on credit card statements (like “SQ STARBUCKS#123SEATTLEWA”) into recognizable business names (such as “Starbucks Coffee Company”).

Their strategy was thoughtfully structured into three distinct stages: First, fine-tuning a Llama 3 8B model to accurately extract critical details from the ambiguous merchant strings, including the name and location. Next, applying a hybrid search approach that leverages these extracted details to efficiently query a comprehensive database of known merchants. Lastly, deploying a Llama 3 70B model that meticulously evaluates the top candidate matches, incorporating an ‘AI judge’ to verify results and minimize errors from hallucinations.

What stood out was Mastercard’s remarkable achievement — a 400% latency reduction coupled with impressive accuracy, all accomplished at minimal expense. Remarkably, they pointed out that fine-tuning each model iteration cost merely a few hundred dollars, demonstrating that even small teams can tackle significant challenges using AI.

Inspired by this project, I couldn’t wait for the workshop, so I decided to just dive in and try building it myself.

The overall goal of this project is to build an AI-powered system that can automatically match those tricky merchant names from the financial records to clean, standardized business names.

I broke down the problem in three stages:

  1. Stage 1: Fine-tuning the Model — This is what this blog post is all about! Teaching the LLM to understand and break down the messy merchant text.
  2. Stage 2: Search — Using the clean, parsed information to efficiently find potential matches in a database of known businesses. I will be using Elastic search and FAISS for this.
  3. Stage 3: Re-ranking — Using another, potentially larger LLM to evaluate the search results and pick the best match, ensuring high accuracy. The plan is to use Deepseek-R1.

Before tackling all the three stages, I will focus specifically on that exciting first stage: fine-tuning the LLM for merchant data parsing!

Before fine-tuning, you need examples to teach the model! So, I started by looking for similar datasets on Kaggle and found some helpful ones to get the ball rolling and understand the structure of the merchant data.

But to really make the model good at handling the specific kind of messiness, I systematically adding variations of noise — asterisks, location codes, random numbers, abbreviations — to mimic real-world merchant descriptors. This synthetic data was the key to getting enough relevant examples.

Fine-tuning is like taking a brilliant generalized LLM and teaching it a very specific, expert skill for your particular task. It customizes the model’s understanding and behavior for your domain. This means:

  • Better accuracy on your specific type of data.
  • Less chance of it giving you weird or irrelevant answers.
  • Improved performance on domain-specific terms or abbreviations.
  • Less chance of deviating from specified output format.

Simply put, fine-tuning helps the model truly get the unique language and structure of your merchant data.

Here’s a simplified look at how I tackled the fine-tuning part. I will be using Python, Hugging Face libraries, and LoRA .

Step 1: Setting Up The Workspace

You will need the following libraries to set up the workspace.

pip install transformers datasets peft torch

Step 2: Loading The Dataset

Load the synthetic data that was generated. The training data is in JSON Lines (.jsonl) format, which makes it easy to load using the Hugging Face datasets library.

from transformers import AutoModelForCausalLM, AutoTokenizer, Trainer, TrainingArguments
from peft import LoraConfig, get_peft_model, TaskType
from datasets import load_dataset

# Loading the training data
dataset = load_dataset("json", data_files="../Data/parser_train.jsonl")

Step 3: Tokenizing the Text

LLMs don’t read words, they process numbers (tokens). So, you need to break down your text data into tokens using a tokenizer compatible with the chosen Llama 3 model.

from transformers import AutoTokenizer

model_name = "meta-llama/Llama-3.2-1B"
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=False)

# Ensure the tokenizer has a padding token
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token

# Tokenization
def tokenize(batch):
# Combine prompt and completion for Causal LM training
# Assumes your dataset has 'prompt' and 'completion' columns
text = [p + c for p, c in zip(batch["prompt"], batch["completion"])]
# Tokenize the combined text
tokenized = tokenizer(text, truncation=True, padding="max_length", max_length=512)
# In causal LM, labels are the same as input_ids
tokenized["labels"] = tokenized["input_ids"].copy()
return tokenized

# Apply tokenization to the 'train' split and remove original text columns
original_columns = dataset['train'].column_names
dataset = dataset['train'].map(
tokenize,
batched=True,
remove_columns=original_columns # Remove 'prompt', 'completion', etc.
)

Step 4: Setting Up LoRA and Loading the Model

This is where the magic of LoRA comes in! Instead of modifying billions of parameters in the full model, LoRA injects tiny, trainable matrices into key layers. This drastically reduces the number of parameters you need to train, saving memory and time.

# Load model directly onto MPS device before applying PEFT
model = AutoModelForCausalLM.from_pretrained(
model_name,
device_map="mps" # Explicitly load to MPS
)

# Configure LoRA
config = LoraConfig(
task_type=TaskType.CAUSAL_LM,
inference_mode=False,
r=8, lora_alpha=32, lora_dropout=0.05
)

model = get_peft_model(model, config)

Step 5: Training Time!

Now for the actual training loop. Hugging Face’s Trainer makes this relatively straightforward.

# Training arguments
training_args = TrainingArguments(
output_dir="./adapter",
per_device_train_batch_size=2,
num_train_epochs=1,
logging_steps=100,
save_total_limit=1
)

#%%
trainer = Trainer(
model=model,
args=training_args,
train_dataset=dataset
)
trainer.train()

Common Pitfall Alert!

Out-of-Memory (OOM) Errors: This is probably the most common issue when fine-tuning LLMs! It means your GPU/VRAM ran out of space.

  • Solution: Lower Batch Size: Reduce per_device_train_batch_size.

Step 6: Saving Your Fine-Tuned Model & Serving with Ollama

The model has now learnt its new skill. Next, save those LoRA adapters and get it ready to use. Ollama is a fantastic tool for running LLMs locally, and can easily serve fine-tuned model with it.

model.save_pretrained("./adapter")
tokenizer.save_pretrained("./adapter")

Step 6.1: Convert LoRA Adapter to GGUF Format

Ollama works great with models and adapters in the GGUF format. You
need to convert the saved Hugging Face PEFT adapter into this format. A
common way to do this is by using the convert_lora_to_gguf.py script
which is a part of the llama.cpp project (clone llama.cpp).

# (If you haven’t already) clone llama.cpp so you have the conversion script:
git clone https://github.com/ggerganov/llama.cpp.git

# From your project root, run:
python llama.cpp/scripts/convert_lora_to_gguf.py \
--adapter-dir ./adapter \
--outfile merchant-parser-adapter.gguf

This command will generate a file named merchant-parser-adapter.gguf. This fine-tuned model is now ready for Ollama!

Step 6.2: Writing Ollama Modelfile

Combine Llama 3.2 model (which you should already have installed
in Ollama, e.g., by running ollama run llama3.2:1b) with new GGUF
adapter file. In the same folder where you saved merchant-parser-
adapter.gguf
, create a new file named Modelfile.

# Modelfile for your fine-tuned merchant parser
# Start from the base Llama-3.2 1B Instruct model
FROM llama3.2:1b

# Point to converted GGUF adapter:
ADAPTER ./merchant-parser-adapter.gguf

# System prompt for parsing tasks:
SYSTEM "You are an expert at parsing raw merchant transaction descriptors into structured data."

Step 6.3: Building the Ollama Model

Now, use the Ollama command-line tool to build this new model based on your Modelfile. Open your terminal in the directory containing your Modelfile and the .gguf adapter file.

# From the same directory containing Modelfile & merchant-parser-adapter.gguf:
ollama create llama-merchant-parser -f Modelfile

Ollama will package the base model with the adapter. This might take a minute or two. Custom fine-tuned model is now ready within Ollama!

We fine-tuned a Llama 3.2 model, converted it and got it running for inference to understand merchant data. This is the core merchant data parsing engine!

This step is crucial for turning that messy input into clean, structured information. This is just the first piece of the puzzle. The project involves two more key stages to achieve full merchant matching:

  1. The Search: Efficiently finding candidate matches in a database using the parsed data.
  2. The Re-ranking: Using a larger model or other methods to select the best possible match from the candidates.

Stay tuned for my next blog posts where I will be sharing my experience building the next stages of the cost-effective AI system. I will also post the complete GitHub link to the project along with generated synthetic data

Recent Articles

Related Stories

Leave A Reply

Please enter your comment!
Please enter your name here