Fine-tuning LLMs with 32-bit, 8-bit, and Paged AdamW Optimizers


Finding the right trade-off between memory efficiency, accuracy, and speed

Towards Data Science
Generated with Grok

Fine-tuning large language models (LLMs) has become an essential yet resource-intensive task, demanding considerable GPU memory — especially when using the AdamW optimizer, which can quickly consume available resources. For each model parameter, AdamW requires the storage of two additional optimizer states in memory, each typically in float32 format. This translates to an extra 8 bytes per parameter, meaning that for a model with 8 billion parameters, such as Llama 3.1, roughly 64 GB of memory goes solely toward managing optimizer states.

The use of quantized and paged optimizers can significantly reduce memory overhead. Libraries like bitsandbytes have facilitated these memory-efficient approaches, making them increasingly popular.

In this article, we will make a comparative analysis of AdamW-32bit, its 8-bit counterpart, and paged AdamW optimizers, examining their impact on memory consumption, learning curves, and training time. Our goal is to identify when memory-efficient optimizers are essential and evaluate their trade-offs in training speed and model accuracy. In the first section, we will review AdamW 8-bit and its paged variant. Then, we will benchmark…

Recent Articles

Related Stories

Leave A Reply

Please enter your comment!
Please enter your name here