How to Build a Text Classification Model with Hugging Face Transformers


How to Build a Text Classification Model with Hugging Face Transformers
Image by Author | Ideogram

 

The well-known Hugging Face Transformers library allows users to leverage pre-trained language models and fine-tune them on their own data, addressing specific use cases without needing to train one of these highly sophisticated models from scratch.

Now, can this library be used to train your own model from scratch for a specific task such as text classification? The answer is yes. And it takes fewer lines of code than you might think. Let’s see how!

 

Building a Text Classification Model in Five Steps

 
Building a transformer-based text classification model using Hugging Face Transformers, boils down to five steps, described below.

Pre-requisite: installing Hugging Face Transformers and Datasets libraries.

!pip install transformers datasets

 

1. Load the Training Data

 
The following code loads a training and test set from the imdb dataset for movie review classification: a common text classification scenario. Note that the below examples only take 1% of the default training and test partitions in the dataset, to ensure efficient training for illustrative purposes (training a transformer-based model usually takes hours!). In a more serious or application-oriented scenario, you’ll want to take much more data so that the trained model learns much better to do its job.

from datasets import load_dataset
training_data = load_dataset('imdb', split="train[:1%]")
test_data = load_dataset('imdb', split="test[:1%]")

 

 

2. Tokenize the Data

 
The next step is tokenizing the data, that is, converting the texts into token-based numerical representations that the language model can process and understand. Tokens are natural language “units” into which each text input is decomposed, typically words and punctuation signs. The AutoTokenizer class helps simplify the process:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased')
def tokenize_function(example):
    return tokenizer(example['text'], padding='max_length', truncation=True)

tokenized_training_data = training_data.map(tokenize_function, batched=True)
tokenized_test_data = test_data.map(tokenize_function, batched=True)

 

3. Load and Initialize Model Architecture

 
Next, we load and initialize our model. Even though the model will be trained from scratch, Hugging Face Transformers provides specifications for different transformer model architectures adapted to different tasks. This saves us a huge burden of having to manually build the entire architecture. The DistilBert models are an example of comparatively lightweight models for binary text classification, e.g. classifying movie reviews into positive or negative.

from transformers import DistilBertForSequenceClassification

# Define the model with random weights, suitable for binary classification (2 classes)
model = DistilBertForSequenceClassification.from_pretrained(
    'distilbert-base-uncased', num_labels=2
)

 

4. Train Your Model

 
Training a transformer-based model with Hugging Face is similar to fine-tuning a pre-trained one. It requires instances of the Trainer and TrainingArguments classes (explained in this post), passed into the train() method, which may take longer or shorter to execute depending on the size of the training data, the model, and other specifications like the batch size.

from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=1,     # Small number of epochs
    per_device_train_batch_size=8,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset,
)

trainer.train()

 

5. Evaluate Your Model

 
After training your model, we normally evaluate it using test data. The trainer.evaluate() function is the simplest approach to do this, which returns the loss and other metrics depending on the specific task, helping assess the model’s performance on unseen data.

trainer.evaluate(tokenized_test_data)

 

An example output might look like this:

'eval_loss': 0.0030956582631915808,
 'eval_runtime': 216.8128,
 'eval_samples_per_second': 1.153,
 'eval_steps_per_second': 0.148,
 'epoch': 1.0

 

Remember, if you used a small portion of the data to quickly train the model, don’t expect great evaluation outcomes. Training an effective transformer-based language model takes time, even for simpler language tasks like text classification!
 
 

Iván Palomares Carrascosa is a leader, writer, speaker, and adviser in AI, machine learning, deep learning & LLMs. He trains and guides others in harnessing AI in the real world.

Recent Articles

Related Stories

Leave A Reply

Please enter your comment!
Please enter your name here