It involves converting the weights from FP16 to INT8, effectively halving the size of the LLM. The method claims to efficiently reduce the size of LLMs up to 175B parameters without performance degradation.
Before going to the details of the paper [1], it’s important to understand that LLMs have emergent features — patterns that arise from the training data and are crucial for the model’s performance. Some of these features can have large magnitudes and can exert a strong influence over the model’s overall performance.
Steps involved:
- The LLM.int8() method starts with vector-wise quantization. This means that each vector (a row in the matrix) is quantized separately, using its own normalization constant. The relative significance of each feature is thus preserved.
- For each vector, a normalization constant is calculated that is used to scale the vectors so that they can be represented as 8-bit integers. By using the normalization constants, most of the features in the LLM are quantized.
- For emergent outliers — features with unusually large magnitudes — a mixed-precision decomposition scheme is used. This isolates these outlier features into a separate 16-bit matrix multiplication, ensuring they are handled accurately while still allowing more than 99.9% of the values to be multiplied in 8-bit.
Pros
LLMs can be quantized and used immediately for inference without performance degradation.
Cons
The method focuses only on the INT8 datatype and models of up to 175B parameters (especially OPT-175B / BLOOM).
Code Implementation
from transformers import AutoTokenizer, AutoModelForCausalLM
import torchmodel_id = "meta-llama/Llama-2-7b-chat-hf"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model_8bit = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", load_in_8bit=True)
GPTQ (Oct 2022)
GPTQ was an early one-shot PTQ technique that enabled efficient deployment of large language models. It was achieved mainly through the two features proposed in the paper [4],
- Layerwise Quantization
Quantization is performed layer by layer in the LLM. The goal is to find a simpler version of the weights that still gives us a good result when we use it to make predictions. This is done in a way that the difference between the original and the simplified weights is as small as possible- ie, lowest mean squared error. - Optimal Brain Quantization
It is an algorithm intended to reduce errors introduced in the model due to quantization. While quantizing a weight, the remaining weights are adjusted.
Pros
GPTQ allows for quantization up to 2 bits, providing a range of trade-offs between model size and performance.
Cons
Quantization by this method introduces considerable performance degradation.
Code Implementation
Install the required libraries.
pip install auto-gptq transformers accelerate
Load the model and quantize it with the autogptq library.
from transformers import AutoModelForCausalLM, AutoTokenizer, GPTQConfig
model_id = "meta-llama/Llama-2-7b-hf"
tokenizer = AutoTokenizer.from_pretrained(model_id)
quant_config = GPTQConfig(bits=4, dataset = "c4", tokenizer=tokenizer)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", quantization_config=quant_config)
QLoRA (May 2023)
Before diving into QLoRA, here is a brief introduction to LoRA. LoRA (Low-Rank Adaptation of Large Language Models) is a parameter-efficient fine-tuning method used to specialize LLMs for particular tasks. It achieves this by integrating trainable matrices based on rank decomposition into every transformer layer. Moreover, it minimizes the number of parameters that need to be trained for the targeted task, all the while maintaining the original pre-trained model weights unchanged. Read more about it here.
QLoRA is an enhanced version of LoRA. Here are the highlights in this method as described in the paper [2]:
1. 4-bit Normal Float Quantization:
The 4-bit Normal Float operates by calculating the 2ᵏ+1 quantiles (where k is the bit count) within a distribution ranging from 0 to 1, subsequently normalizing these values to fit within the [-1, 1] interval. With this normalization, we can similarly adjust our neural network weights to the [-1, 1] range and proceed with quantization.
2. Double Dequantization:
This involves quantizing the quantization constants employed in the 4-bit NF quantization process. It can conserve an average of 0.5 bits per parameter. This is beneficial because QLoRA utilizes Block-wise k-bit Quantization.
3. Paged Optimizations:
QLoRA involves efficient page transfers from GPU to CPU using Nvidia’s unified memory feature. This prevents GPU overloads and makes the training efficient without interrupting.
Pros
QLoRA, due to lower GPU memory usage, can support higher max sequence lengths and a higher number of batches.
Cons
It can be slower in terms of tuning speed. It also stands on the lower side in cost efficiency but that is not a matter of concern.
Code Implementation
Install the required libraries
pip install -q -U trl transformers accelerate git+https://github.com/huggingface/peft.git
pip install -q datasets bitsandbytes
Load the model and tokenizer. Configure the LoRA parameters.
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, AutoTokenizermodel_id = "meta-llama/Llama-2-7b-chat-hf"
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.float16,
)
model = AutoModelForCausalLM.from_pretrained(
model_id,
quantization_config=bnb_config,
trust_remote_code=True
)
model.config.use_cache = False
from peft import LoraConfig, get_peft_model
lora_alpha = 16
lora_dropout = 0.1
lora_r = 64
peft_config = LoraConfig(
lora_alpha=lora_alpha,
lora_dropout=lora_dropout,
r=lora_r,
task_type="CAUSAL_LM"
)
Set up the trainer using SFTTrainer
from the TRL library that gives a wrapper around transformers Trainer
to easily fine-tune models on instruction-based datasets using PEFT adapters. Of course, you will need a dataset to train.
from transformers import TrainingArgumentsoutput_dir = "./models"
per_device_train_batch_size = 4
gradient_accumulation_steps = 4
optim = "paged_adamw_32bit"
save_steps = 100
logging_steps = 10
learning_rate = 2e-4
max_grad_norm = 0.3
max_steps = 100
warmup_ratio = 0.03
lr_scheduler_type = "constant"
training_arguments = TrainingArguments(
output_dir=output_dir,
per_device_train_batch_size=per_device_train_batch_size,
gradient_accumulation_steps=gradient_accumulation_steps,
optim=optim,
save_steps=save_steps,
logging_steps=logging_steps,
learning_rate=learning_rate,
fp16=True,
max_grad_norm=max_grad_norm,
max_steps=max_steps,
warmup_ratio=warmup_ratio,
group_by_length=True,
lr_scheduler_type=lr_scheduler_type,
)
from trl import SFTTrainer
max_seq_length = 512
trainer = SFTTrainer(
model=model,
train_dataset=dataset,
peft_config=peft_config,
dataset_text_field="text",
max_seq_length=max_seq_length,
tokenizer=tokenizer,
args=training_arguments,
)
trainer.train()
AWQ (Jun 2023)
AWQ (Activation-Aware Weight Quantization) is a Post-Training Quantization method. In this method, the activations of the model are considered in place of weights. Let me quote it directly from the paper [3],
Our method is based on the observation that weights are not equally important: protecting only 1% of salient weights can greatly reduce quantization error. We then propose to search for the optimal per-channel scaling that protects the salient weights by observing the activation, not weights.
Pros
AWQ provides more accuracy than other methods as weights critical to the LLM performance are preserved. It is also efficient and faster as it does not involve backpropagation or reconstruction. It performs well on edge devices.
Cons
While maintaining 0.1% of weights in FP16 can enhance the performance of quantization without significantly increasing the model size, this mixed-precision data type complicates system implementation.
Code Implementation
Install required libraries.
!pip install autoawq transformers accelerate
Load the model and quantize it with the autoawq library.
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer
model_id = 'meta-llama/Llama-2-7b-hf'
quant_path = 'Llama2-7b-awq-4bit'
quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4 }# Load model and tokenizer
model = AutoAWQForCausalLM.from_pretrained(model_id)
tokenizer = AutoTokenizer.from_pretrained(model_id, use_fast=True)
# Quantize
model.quantize(tokenizer, quant_config=quant_config)
Quip# (Jul 2023)
In simple terms, QuIP (Quantization with Incoherence Processing) is based on the idea that the process of quantization can be improved if the weights of the model are evenly distributed (incoherent), and the important directions for rounding them are not aligned with the coordinate axes. It consists of two steps:
- LDLQ Adaptive rounding procedure: Adjust the weights of the model in a way that minimizes a certain measure of error (the ‘quadratic proxy objective’) [8].
- Pre- and post-processing: Multiply the weight and Hessian matrices by random orthogonal matrices. This ensures that the weights and Hessians are incoherent, which is beneficial for the quantization process.
QuIP# [5] advances on QuIP using some improvements in processing.
- Improved Incoherence Processing: It uses a faster and better method called the randomized Hadamard transform.
- Vector Quantization: QuIP# uses vector quantization to leverage the ball-shaped sub-Gaussian distribution that incoherent weights possess. Specifically, it introduces a set of hardware-efficient codebooks based on the highly symmetric E8 lattice. The E8 lattice achieves the optimal 8-dimension unit ball packing, which means it can represent the weights more efficiently.
Pros
Compared to other methods, QuIP# offers significantly higher throughput (>40%) at the same or better quantization quality. That is not bad for a 2-bit quantization.
Cons
Although not many limitations are mentioned, complexity and hardware compatibility can be considered.
Code Implementation
Clone the official repo and install the required libraries.
git clone https://github.com/Cornell-RelaxML/quip-sharp.git
pip install -r requirements.txt
cd quiptools && python setup.py install && cd ../
Find the scripts for various models. Run the script quantize_finetune_llama.py
to use llama models.
Also, check out the repo for quip quantization. The code for quantizing models is as shown.
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from quantizer import QuipQuantizermodel_name = "meta-llama/Llama-2-70b-hf"
quant_dir = "llama-70b_2bit_quip"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16)
quant = QuipQuantizer(codebook="E8P12", dataset="redpajama")
quant.quantize_model(model, tokenizer, quant_dir)
GGUF (Aug 2023)
GGUF (GPT-Generated Unified Format) was a highly anticipated release by Georgi Gerganov and the llama.cpp team. The main highlight was indeed the feature that LLMs could now be run easily on consumer CPUs. Earlier it was called GGML and later upgraded to GGUF.
A notable achievement of GGML was the ability to offload certain layers of the LLM to GPU if required even while the LLM operates on the CPU. This effectively addresses the global challenge developers face due to inadequate VRAM.
Pros
If you plan to run LLMs on CPU or Apple devices (the M series chips), it is the goto method for many LLMs like Llama and Mistral. GGUF file format is now well supported by llama.cpp and HuggingFace. GGUF models also show lower perplexity scores compared to other formats.
Cons
GGUF is focused on CPU and Apple M series devices. This could be a limitation if you’re working with different hardware configurations.
Code Implementation
Install the ctransformers
library.
pip install ctransformers[cuda]
Models are available in the repositories by Bloke in HuggingFace.
from ctransformers import AutoModelForCausalLM
from transformers import AutoTokenizer, pipeline# Load LLM and Tokenizer
# Use `gpu_layers` to specify how many layers will be offloaded to the GPU.
model = AutoModelForCausalLM.from_pretrained(
"TheBloke/zephyr-7B-beta-GGUF",
model_file="zephyr-7b-beta.Q4_K_M.gguf",
model_type="mistral", gpu_layers=50, hf=True
)
tokenizer = AutoTokenizer.from_pretrained(
"HuggingFaceH4/zephyr-7b-beta", use_fast=True
)
# Create a pipeline
pipe = pipeline(model=model, tokenizer=tokenizer, task='text-generation')
from ctransformers import AutoModelForCausalLM
from transformers import AutoTokenizer, pipeline# Load LLM and Tokenizer
# Use `gpu_layers` to specify how many layers will be offloaded to the GPU.
model = AutoModelForCausalLM.from_pretrained(
"TheBloke/zephyr-7B-beta-GGUF",
model_file="zephyr-7b-beta.Q4_K_M.gguf",
model_type="mistral", gpu_layers=50, hf=True
)
tokenizer = AutoTokenizer.from_pretrained(
"HuggingFaceH4/zephyr-7b-beta", use_fast=True
)
# Create a pipeline
pipe = pipeline(model=model, tokenizer=tokenizer, task='text-generation')
HQQ (Nov 2023)
According to the paper, weight calibration can be achieved by data-free calibration techniques (BitsAndBytes) and calibration-based techniques (GPTQ and AWQ). While calibration-free methods are faster, calibration-based methods suffer from data bias and quantization time.
HQQ (Half-Quadratic Quantization) carries out quantization in real time using rapid and sturdy optimization. It eliminates the need for calibration data and is versatile enough to quantize any given model, thus achieving speed of calibration-free methods without data bias issues. It drastically reduced quantization time to almost a few minutes due to optimization techniques like half-quadratic splitting. For more details on the math and working of the method, see the official website.
Pros
Achieved surprisingly low quantization time compared to other methods (50x faster compared to GPTQ!). The elimination of calibration data requirements makes it easier.
Cons
Not many limitations are mentioned elsewhere. It may still show quality degradation like other methods.
Code Implementation
Install the transformers library and use HQQ implementation straightaway!
from transformers import AutoModelForCausalLM, HqqConfig# All linear layers will use the same quantization config
quant_config = HqqConfig(nbits=4, group_size=64, quant_zero=False, quant_scale=False, axis=1)
model_id = "meta-llama/Llama-2-7b-hf"
# Load and quantize
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.float16,
device_map="cuda",
quantization_config=quant_config
)
AQLM (Feb 2024)
AQLM (Additive Quantization of Language Models) is a weight-only PTQ method that sets a new benchmark in the 2-bit-per-parameter range. It outperforms popular algorithms like GPTQ as well as QuIP and QuIP#.
It applies a new method called Multi-Codebook Quantization (MCQ) which divides each vector into sub-vectors and approximates them using a finite set of codewords. Codewords are already learned vectors defined in a codebook [7]. AQLM works by taking the rows of the weight matrices in a model and quantizing them.
Pros
AQLM offers rapid implementations for token generation on both GPU and CPU, allowing it to surpass the speed of optimized FP16 implementations, all while operating within a significantly reduced memory footprint.
Cons
Only a few limitations are mentioned elsewhere. It may still show quality degradation like other methods.
Code Implementation
The instructions on how to quantize models yourself and the corresponding code can be found in the official repo. To run AQLM models, load a model that has been quantized with AQLM:
from transformers import AutoTokenizer, AutoModelForCausalLMquantized_model = AutoModelForCausalLM.from_pretrained(
"ISTA-DASLab/Mixtral-8x7b-AQLM-2Bit-1x16-hf",
torch_dtype="auto",
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("ISTA-DASLab/Mixtral-8x7b-AQLM-2Bit-1x16-hf")
Quantization methods have opened up a world of possibilities, enabling advanced language processing capabilities even in our pockets. In this article, we discussed all about LLM quantization and explored in detail various methods to quantize LLMs. We also went through the ups and downs of each approach and learned how to use them. Furthermore, we gained insights on how to select the most suitable approach based on specific requirements and whether you are using a CPU or GPU.