Why Customize LLMs?
Large Language Models (Llms) are deep learning models pre-trained based on self-supervised learning, requiring a vast amount of resources on training data, training time and holding a large number of parameters. LLM have revolutionized natural language processing especially in the last 2 years, demonstrating remarkable capabilities in understanding and generating human-like text. However, these general purpose models’ out-of-the-box performance may not always meet specific business needs or domain requirements. LLMs alone cannot answer questions that rely on proprietary company data or closed-book settings, making them relatively generic in their applications. Training a LLM model from scratch is largely infeasible to small to medium teams due to the demand of massive amounts of training data and resources. Therefore, a wide range of LLM customization strategies are developed in recent years to tune the models for various scenarios that require specialized knowledge.
The customization strategies can be broadly split into two types:
- Using a frozen model: These techniques don’t necessitate updating model parameters and typically accomplished through in-context learning or prompt engineering. They are cost-effective since they alter the model’s behavior without incurring extensive training costs, therefore widely explored in both the industry and academic with new research papers published on a daily basis.
- Updating model parameters: This is a relatively resource-intensive approach that requires tuning a pre-trained LLM using custom datasets designed for the intended purpose. This includes popular techniques like Fine-Tuning and Reinforcement Learning from Human Feedback (RLHF).
These two broad customization paradigms branch out into various specialized techniques including LoRA fine-tuning, Chain of Thought, Retrieval Augmented Generation, ReAct, and Agent frameworks. Each technique offers distinct advantages and trade-offs regarding computational resources, implementation complexity, and performance improvements.
How to Choose LLMs?
The first step of customizing LLMs is to select the appropriate foundation models as the baseline. Community based platform e.g. “Huggingface” offers a wide range of open-source pre-trained models contributed by top companies or communities, such as Llama series from Meta and Gemini from Google. Huggingface additionally provides leaderboards, for example “Open LLM Leaderboard” to compare LLMs based on industry-standard metrics and tasks (e.g. MMLU). Cloud providers (e.g., AWS) and AI companies (e.g., OpenAI and Anthropic) also offer access to proprietary models that are typically paid services with restricted access. Following factors are essentials to consider when choosing LLMs.
Open source or proprietary model: Open source models allow full customization and self-hosting but require technical expertise while proprietary models offer immediate access and often better quality responses but with higher costs.
Task and metrics: Models excel at different tasks including question-answering, summarization, code generation etc. Compare benchmark metrics and test on domain-specific tasks to determine the appropriate models.
Architecture: In general, decoder-only models (GPT series) perform better at text generation while encoder-decoder models (T5) handle translation well. There are more architecture emerging and showing promising results, for instance Mixture of Experts (MoE) model “DeepSeek”.
Number of Parameters and Size: Larger models (70B-175B parameters) offer better performance but need more computing power. Smaller models (7B-13B) run faster and cheaper but may have reduced capabilities.
After determining a base LLM, let’s explore 6 most common strategies for LLM customization, ranked in order of resource consumption from the least to the most intensive:
- Prompt Engineering
- Decoding and Sampling Strategy
- Retrieval Augmented Generation
- Agent
- Fine Tuning
- Reinforcement Learning from Human Feedback
If you’d prefer a video walkthrough of these concepts, please check out my video on “6 Common LLM Customization Strategies Briefly Explained”.
LLM Customization Techniques
1. Prompt Engineering
Prompt is the input text sent to an LLM to elicit an AI-generated response, and it can be composed of instructions, context, input data and output indicator.
Instructions: This provides a task description or instruction for how the model should perform.
Context: This is external information to guide the model to respond within a certain scope.
Input data: This is the input for which you want a response.
Output indicator: This specifies the output type or format.
Prompt Engineering involves crafting these prompt components strategically to shape and control the model’s response. Basic prompt engineering techniques include zero shot, one shot, and few shot prompting. User can implement basic prompt engineering techniques directly while interacting with the LLM, making it an efficient approach to align model’s behavior to on a novel objective. API implementation is also an option and more details are introduced in my previous article “A Simple Pipeline for Integrating LLM Prompt with Knowledge Graph”.
Due to the efficiency and effectiveness of prompt engineering, more complex approaches are explored and developed to advance the logical structure of prompts.
Chain of Thought (CoT) asks LLMs to break down complex reasoning tasks into step-by-step thought processes, improving performance on multi-step problems. Each step explicitly exposes its reasoning outcome which serves as the precursor context of its subsequent steps until arriving at the answer.
Tree of thoughts extends from CoT by considering multiple different reasoning branches and self-evaluating choices to decide the next best action. It is more effective for tasks that involve initial decisions, strategies for the future and exploration of multiple solutions.
Automatic reasoning and tool use (ART) builds upon the CoT process, it deconstructs complex tasks and allows the model to select few-shot examples from a task library using predefined external tools like search and code generation.
Synergizing reasoning and acting (ReAct) combines reasoning trajectories with an action space, where the model search through the action space and determine the next best action based on environmental observations.
Techniques like CoT and ReAct are often combined with an Agentic workflow to strengthen its capability. These techniques will be introduced in more detail in the “Agent” section.
Further Reading
2. Decoding and Sampling Strategy
data:image/s3,"s3://crabby-images/8a379/8a379e6ef3f02569f01d342e959e8e95fa91c31d" alt=""
Decoding strategy can be controlled at model inference time through inference parameters (e.g. temperature, top p, top k), determining the randomness and diversity of model responses. Greedy search, beam search and sampling are three common decoding strategies for auto-regressive model generation. ****
During the autoregressive generation process, LLM outputs one token at a time based on a probability distribution of candidate tokens conditioned by the pervious token. By default, greedy search is applied to produce the next token with the highest probability.
In contrast, beam search decoding considers multiple hypotheses of next-best tokens and selects the hypothesis with the highest combined probabilities across all tokens in the text sequence. The code snippet below uses transformers library to specify the the number of beam paths (e.g. num_beams=5 considers 5 distinct hypotheses) during the model generation process.
from transformers import AutoModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)
inputs = tokenizer(prompt, return_tensors="pt")
model = AutoModelForCausalLM.from_pretrained(model_name)
outputs = model.generate(**inputs, num_beams=5)
Sampling strategy is the third approach to control the randomness of model responses by adjusting these inference parameters:
- Temperature: Lowering the temperature makes the probability distribution sharper by increasing the likelihood of generating high-probability words and decreasing the likelihood of generating low-probability words. When temperature = 0, it becomes equivalent to greedy search (least creative); when temperature = 1, it produces the most creative outputs.
- Top K sampling: This method filters the K most probable next tokens and redistributes the probability among those tokens. The model then samples from this filtered set of tokens.
- Top P sampling: Instead of sampling from the K most probable tokens, top-p sampling selects from the smallest possible set of tokens whose cumulative probability exceeds the threshold p.
The example code snippet below samples from the top 50 most likely tokens (top_k=50) with a cumulative probability higher than 0.95 (top_p=0.95)
sample_outputs = model.generate(
**model_inputs,
max_new_tokens=40,
do_sample=True,
top_k=50,
top_p=0.95,
num_return_sequences=3,
)
Further Reading
3. RAG
data:image/s3,"s3://crabby-images/dd985/dd9856d675b52fc1e35675d9a545ff58e1cd7ba8" alt=""
Retrieval Augmented Generation (or RAG), initially introduced in the paper “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks”, has been demonstrated as a promising solution that integrates external knowledge and reduces common LLM “hallucination” issues when handling domain specific or specialized queries. RAG allows dynamically pulling relevant information from knowledge domain and generally does not involve extensive training to update LLM parameters, making it a cost-effective strategy to adapt a general-purpose LLM for a specialized domain.
A RAG system can be decomposed into retrieval and generation stage. The objective of retrieval process is to find contents within the knowledge base that are closely related to the user query, by chunking external knowledge, creating embeddings, indexing and similarity search.
- Chunking: Documents are divided into smaller segments, with each segment containing a distinct unit of information.
- Create embeddings: An embedding model compresses each information chunk into a vector representation. The user query is also converted into its vector representation through the same vectorization process, so that the user query can be compared in the same dimensional space.
- Indexing: This process stores these text chunks and their vector embeddings as key-value pairs, enabling efficient and scalable search functionality. For large external knowledge bases that exceed memory capacity, vector databases offer efficient long-term storage.
- Similarity search: Similarity scores between the query embeddings and text chunk embeddings are calculated, which are used for searching information highly relevant to the user query.
The generation process of the RAG system then combines retrieved information with the user query to form the augmented query which is parsed to the LLM to generate the context rich response.
Code Snippet
The code snippet firstly specifies the LLM and embedding model, then perform the steps to chunk the external knowledge base documents
into a collection of document
. Create index
from document
, define the query_engine
based on the index
and query the query_engine
with the user prompt.
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.core import VectorStoreIndex
Settings.llm = OpenAI(model="gpt-3.5-turbo")
Settings.embed_model="BAAI/bge-small-en-v1.5"
document = Document(text="\\n\\n".join([doc.text for doc in documents]))
index = VectorStoreIndex.from_documents([document])
query_engine = index.as_query_engine()
response = query_engine.query(
"Tell me about LLM customization strategies."
)
The example above shows a simple RAG system. Advanced RAG improve based on this by introducing pre-retrieval and post-retrieval strategies to reduce pitfalls such as limited synergy between the retrieval and generation process. For example rerank technique reorders the retrieved information using a model capable of understanding bidirectional context, and integration with knowledge graph for advanced query routing. More use cases can be found on the llamaindex website.
Further Reading
4. Agent
data:image/s3,"s3://crabby-images/ecc9c/ecc9c9d5eb9f097e2fc251d178c7cc9f7cc741a1" alt=""
LLM Agent was a trending topic in 2024 and will likely remain a main focus in the GenAI field in 2025. Compared to RAG, Agent excels at creating query routes and planning LLM-based workflows, with the following benefits:
- Maintaining memory and state of previous model generated responses.
- Leveraging various tools based on specific criteria. This tool-using capability sets agents apart from basic RAG systems by giving the LLM independent control over tool selection.
- Breaking down a complex task into smaller steps and planning for a sequence of actions.
- Collaborating with other agents to form a orchestrated system.
Several in-context learning techniques (e.g. CoT, ReAct ) can be implemented through the Agentic framework and we will discuss ReAct in more details. ReAct, stands for “Synergizing Reasoning and Acting in Language Models”, is composed of three key elements – actions, thoughts and observations. This framework was introduced by Google Research at Princeton University, built upon Chain of Thought by integrating the reasoning steps with an action space that enables tool uses and function calling. Additionally, ReAct framework emphasizes on determining the next best action based on the environmental observations.
This example from the original paper demonstrated ReAct’s inner working process, where the LLM generates the first thought and acts by calling the function to “Search [Apple Remote]”, then observes the feedback from its first output. The second thought is then based on the previous observation, hence leading to a different action “Search [Front Row]”. This process iterates until reaching the goal. The research shows that ReAct overcomes prevalent issues of hallucination and error propagation as more often observed in chain-of-thought reasoning by interacting with a simple Wikipedia API. Furthermore, through the implementation of decision traces, ReAct framework additionally increases the model’s interpretability, trustworthiness and diagnosability.
data:image/s3,"s3://crabby-images/ce9e5/ce9e52aec68802742e3e2acae1274b146eaed227" alt=""
Code Snippet
This demonstrates an ReAct-based agent implementation using llamaindex
. Firstly, it defines two functions (multiply
and add
). Secondly, these two functions are encapsulated as FunctionTool
, forming the Agent’s action space and executed based on its reasoning.
from llama_index.core.agent import ReActAgent
from llama_index.core.tools import FunctionTool
# create basic function tools
def multiply(a: float, b: float) -> float:
return a * b
multiply_tool = FunctionTool.from_defaults(fn=multiply)
def add(a: float, b: float) -> float:
return a + b
add_tool = FunctionTool.from_defaults(fn=add)
agent = ReActAgent.from_tools([multiply_tool, add_tool], llm=llm, verbose=True)
The advantages of an Agentic Workflow are more substantial when combined with self-reflection or self-correction. It is an increasingly growing domain with a variety of Agent architecture being explored. For instance, Reflexion framework facilitate iterative learning by providing a summary of verbal feedback from environmental and storing the feedback in model’s memory; CRITIC framework empowers frozen LLMs to self-verify through interacting with external tools such as code interpreter and API calls.
Further Reading
5. Fine-Tuning
data:image/s3,"s3://crabby-images/48b8b/48b8b296f4ffbcc2c63c7f640dde1c6714c033c0" alt=""
Fine-tuning is the process of feeding niche and specialized datasets to modify the LLM so that it is more aligned with a certain objective. It differs from prompt engineering and RAG as it enables updates to the LLM weights and parameters. Full fine-tuning refers to updating all weights of the pretrained LLM through backpropogation, which requires large memory to store all weights and parameters and may suffer from significant reduction in ability on other tasks (i.e. catastrophic forgetting). Therefore, PEFT (or parameter efficient fine tuning) is more widely used to mitigate these caveats while saving the time and cost of model training. There are three categories of PEFT methods:
- Selective: Select a subset of initial LLM parameters to fine tune which can be more computationally intensive compared to other PEFT methods.
- Reparameterization: Adjust model weights through training the weights of low rank representations. For example, Lower Rank Adaptation (LoRA) is among this category that accelerates fine-tuning by representing the weight updates with two smaller matrices.
- Additive: Add additional trainable layers to model, including techniques like adapters and soft prompts
The fine-tuning process is similar to deep learning training process., requiring the following inputs:
- training and evaluation datasets
- training arguments define the hyperparameters e.g. learning rate, optimizer
- pretrained LLM model
- compute metrics and objective functions that algorithm should be optimized for
Code Snippet
Below is an example of implementing fine-tuning using the transformer Trainer.
from transformers import TrainingArguments, Trainer
training_args = TrainingArguments(
output_dir=output_dir,
learning_rate=1e-5,
eval_strategy="epoch"
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
compute_metrics=compute_metrics,
)
trainer.train()
Fine-tuning has a wide range of use cases. For instance, instruction fine-tuning optimizes LLMs for conversations and following instructions by training them on prompt-completion pairs. Another example is domain adaptation, an unsupervised fine-tuning method that helps LLMs specialize in specific knowledge domains.
Further Reading
6. RLHF
data:image/s3,"s3://crabby-images/03085/03085b8459d580f16d5df35b7f9f96c83ef14fb4" alt=""
Reinforcement learning from human feedback, or RLHF, is a reinforcement learning technique that fine tunes LLMs based on human preferences. RLHF operates by training a reward model based on human feedback and uses this model as a reward function to optimize a reinforcement learning policy through PPO (Proximal Policy Optimization). The process requires two sets of training data: a preference dataset for training reward model, and a prompt dataset used in the reinforcement learning loop.
Let’s break it down into steps:
- Gather preference dataset annotated by human labelers who rate different completions generated by the model based on human preference. An example format of the preference dataset is
{input_text, candidate1, candidate2, human_preference}
, indicating which candidate response is preferred. - Train a reward model using the preference dataset, the reward model is essentially a regression model that outputs a scalar indicating the quality of the model generated response. The objective of the reward model is to maximize the score between the winning candidate and losing candidate.
- Use the reward model in a reinforcement learning loop to fine-tune the LLM. The objective is that the policy is updated so that LLM can generate responses that maximize the reward produced by the reward model. This process utilizes the prompt dataset which is a collection of prompts in the format of
{prompt, response, rewards}
.
Code Snippet
Open source library Trlx is widely applied in implementing RLHF and they provided a template code that shows the basic RLHF setup:
- Initialize the base model and tokenizer from a pretrained checkpoint
- Configure PPO hyperparameters
PPOConfig
like learning rate, epochs, and batch sizes - Create the PPO trainer
PPOTrainer
by combining the model, tokenizer, and training data - The training loop uses
step()
method to iteratively update the model to optimized therewards
calculated from thequery
and modelresponse
# trl: Transformer Reinforcement Learning library
from trl import PPOTrainer, PPOConfig, AutoModelForSeq2SeqLMWithValueHead
from trl import create_reference_model
from trl.core import LengthSampler
# initiate the pretrained model and tokenizer
model = AutoModelForCausalLMWithValueHead.from_pretrained(config.model_name)
tokenizer = AutoTokenizer.from_pretrained(config.model_name)
# define the hyperparameters of PPO algorithm
config = PPOConfig(
model_name=model_name,
learning_rate=learning_rate,
ppo_epochs=max_ppo_epochs,
mini_batch_size=mini_batch_size,
batch_size=batch_size
)
# initiate the PPO trainer with reference to the model
ppo_trainer = PPOTrainer(
config=config,
model=ppo_model,
tokenizer=tokenizer,
dataset=dataset["train"],
data_collator=collator
)
# ppo_trainer is iteratively updated through the rewards
ppo_trainer.step(query_tensors, response_tensors, rewards)
RLHF is widely applied for aligning model responses with human preference. Common use cases involve reducing response toxicity and model hallucination. However, it does have the downside of requiring a large amount of human annotated data as well as computation costs associated with policy optimization. Therefore, alternatives like Reinforcement Learning from AI feedback and Direct Preference Optimization (DPO) are introduced to mitigate these limitations.
Further Reading
Take-Home Message
This article briefly explains six essential LLM customization strategies including prompt engineering, decoding strategy, RAG, Agent, fine-tuning, and RLHF. Hope you find it helpful in terms of understanding the pros/cons of each strategy as well as how to implement them based on the practical examples.