Fine-tuning LLMs with 32-bit, 8-bit, and Paged AdamW Optimizers

Finding the right trade-off between memory efficiency, accuracy, and speed

Fine-tuning large language models (LLMs) has become an essential yet resource-intensive task, demanding considerable GPU memory — especially when using the AdamW optimizer, which can quickly consume available resources. For each model parameter, AdamW requires the storage of two additional optimizer states in memory, each typically in float32 format. This translates to an extra 8 bytes per parameter, meaning that for a model with 8 billion parameters, such as Llama 3.1, roughly 64 GB of memory goes solely toward managing optimizer states.

The use of quantized and paged optimizers can significantly reduce memory overhead. Libraries like bitsandbytes have facilitated these memory-efficient approaches, making them increasingly popular.

In this article, we will make a comparative analysis of AdamW-32bit, its 8-bit counterpart, and paged AdamW optimizers, examining their impact on memory consumption, learning curves, and training time. Our goal is to identify when memory-efficient optimizers are essential and evaluate their trade-offs in training speed and model accuracy. In the first section, we will review AdamW 8-bit and its paged variant. Then, we will benchmark…

Fine-tuning LLMs with 32-bit, 8-bit, and Paged AdamW Optimizers

Finding the right trade-off between memory efficiency, accuracy, and speed

Recent Articles

7 Essential Ready-To-Use Data Engineering Docker Containers

Google DeepMind Research Introduces QuestBench: Evaluating LLMs’ Ability to Identify Missing Information in Reasoning Tasks

Hackers access sensitive SIM card data at South Korea’s largest telecoms company

Today’s Hurdle hints and answers for April 26, 2025

10 Must-Know Python Libraries for Machine Learning in 2025

Related Stories

Leave A Reply Cancel reply