How to Use the Trainer API in Hugging Face for Custom Training Loops

Image by Editor | Midjourney

Let’s learn how to define custom training loops with Hugging Face’s Trainer API.

Preparation

First, you must install the below packages for this tutorial:

pip install transformers datasets

You also need to install the appropriate PyTorch package which will differ based on your environment, so please do your homework and ensure you have the proper one installed.

With all of your libraries installed, let’s get on with it.

Custom Training Loops with Trainer API

If you have ever performed the standard Transformer fine-tuning, think about how it works under the hood, and how you could try to tweak it for your own purposes. If your use case is not straightforward and requires specific things to be done, we can develop custom training loops with the Trainer API in order to accomplish these things.

We are able to use the Trainer API as it is; however, we are also able to tweak how we use the Trainer in order to develop custom training loops.

Let’s start by preparing the standard Transformer fine-tuning requirement, which includes the pre-trained model, tokenizer, and dataset.

from transformers import BertForSequenceClassification, BertTokenizer
from datasets import load_dataset

model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

dataset = load_dataset('imdb')

We are now using BERT to train a text classifier model.

Next, we will preprocess the data. We will also only select a few data points to boost the training process time.

def preprocess_function(examples):
    return tokenizer(examples['text'], truncation=True, padding=True)

tokenized_datasets = dataset.map(preprocess_function, batched=True)

small_train_dataset = tokenized_datasets['train'].shuffle(seed=42).select(range(100))
small_eval_dataset = tokenized_datasets['test'].shuffle(seed=42).select(range(50))

Now we set up the training arguments. We will use only a single epoch with a larger batch size to improve the training time.

from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    logging_dir="./logs",
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    num_train_epochs=1,
    logging_steps=10,
    save_total_limit=2,
)

Now we will proceed to developing our custom training loops with the help of Transformers. Here is an example of how we develop our custom Trainer.

from torch.optim import AdamW
from transformers import get_scheduler
from transformers import Trainer

class CustomTrainer(Trainer):
    def create_optimizer_and_scheduler(self, num_training_steps):
        if self.optimizer is None:
            self.optimizer = AdamW(self.model.parameters(), lr=self.args.learning_rate)
        if self.lr_scheduler is None:
            self.lr_scheduler = get_scheduler(
                name="linear",
                optimizer=self.optimizer,
                num_warmup_steps=0,
                num_training_steps=num_training_steps,
            )

    def train(self, resume_from_checkpoint=None, trial=None, ignore_keys_for_eval=None, **kwargs):
        # Initialize optimizer and scheduler
        num_training_steps = len(self.get_train_dataloader()) * self.args.num_train_epochs
        self.create_optimizer_and_scheduler(num_training_steps)

        model = self.model
        for epoch in range(int(self.args.num_train_epochs)):
            print(f"Starting epoch epoch + 1")
      
            for step, batch in enumerate(self.get_train_dataloader()):   
                outputs = model(**batch)
                loss = outputs.loss
                loss.backward()
               
                self.optimizer.step()
                self.lr_scheduler.step()
                self.optimizer.zero_grad()

                if step % self.args.logging_steps == 0:
                    print(f"Step step: Loss = loss.item()")

        print("Training is Done")

So what happens in the code above? There are a few things that we customize:

We use AdamW optimizer to update model weights during training
We set the scheduler, which is a linear learning rate, to reduce the learning rate over time
Setting up the training loop and having a specific log for each step

This is how we customize our trainer, which you can tweak even more if you need to introduce something more specific.

Lastly, we would train and evaluate the model.

trainer = CustomTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets['train'],
    eval_dataset=tokenized_datasets['test'],
)

trainer.train()

evaluation_results = trainer.evaluate()

print(evaluation_results)

Output:

'eval_loss': 0.15452663600444794, 'eval_model_preparation_time': 0.0038, 'eval_runtime': 765.5939, 'eval_samples_per_second': 32.654, 'eval_steps_per_second': 1.021

Try to master the training loop customization to improve your training workflow.

Additional Resources

Cornellius Yudha Wijaya is a data science assistant manager and data writer. While working full-time at Allianz Indonesia, he loves to share Python and data tips via social media and writing media. Cornellius writes on a variety of AI and machine learning topics.

How to Use the Trainer API in Hugging Face for Custom Training Loops

Preparation

Custom Training Loops with Trainer API

Additional Resources

Recent Articles

LLM continuous self-instruct fine-tuning framework powered by a compound AI system on Amazon SageMaker

From $1.5B Crypto Heist to AI Misuse & Apple’s Data Dilemma

Applying The Web Dev Mindset To Dealing With Life Challenges

Yope is sparking GenZ (and VC) interest with an Instagram-like app for private groups

Meta AI Introduces MLGym: A New AI Framework and Benchmark for Advancing AI Research Agents

Related Stories

Leave A Reply Cancel reply