Implementing Multimodal Models with Hugging Face Transformers

Image by Editor | Ideogram

Machine learning models have made a lot of progress in the past year. Artificial intelligence seems able to do anything that might be close to that. Speaking of models that can do anything, this is where we will discuss Multimodal Models.

Multimodal models are not your typical machine learning model, as they are designed to understand multiple data types (modalities). The data can be anything, such as text, images, audio, video, and more. The model could also generate different types of data. Application examples include a text-to-speech model, visual questioning answering model, image captioning model, etc.

In this article, we will try to implement multimodal models with Hugging Face Transformers. The open-source company has hosted many pre-trained models that we can use, including the multimodal model.

How can we do that? Let’s get into it.

Preparation

Let’s start with installing the required packages for this tutorial.

pip install transformers datasets pillow requests

You also want to install the PyTorch package, as it’s an integral part of our activity.

Once everything is installed perfectly, we will jump into the main article.

Implement the Multimodal Visual-Question Answering (VQA) Model

The visual-questions-answering model can answer questions about images by providing textual questions. This model would receive input from both the image and text input and output the text.

This article will use a pre-trained multimodal model called BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation). This is a base model for the VQA use case and could be fine-tuned for any downstream process.

Let’s try the base model to perform VQA with the image example. For the sample, we would use the sample cat image from the COCO dataset. First, let’s import the necessary package and set up the image loader for the model.

import torch
from transformers import BlipProcessor, BlipForQuestionAnswering
from PIL import Image
import requests


def load_image(image_url):
    return Image.open(requests.get(image_url, stream=True).raw).convert('RGB')

The image we acquire will be converted into an RGB colour image.

Next, we would load the BLIP model and the processor to process the image data. Hugging Face Transformer already has all the necessary pipelines, and we only need to use them.

processor = BlipProcessor.from_pretrained("Salesforce/blip-vqa-base")
model = BlipForQuestionAnswering.from_pretrained("Salesforce/blip-vqa-base")

With the model and processor ready, we would try to answer the questions about the image.

image_url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = load_image(image_url)

# Example question
question = "How many cats are in the image?"

# Preparing input and generating answer output
inputs = processor(image, question, return_tensors="pt")
out = model.generate(**inputs)
answer = processor.decode(out[0], skip_special_tokens=True)

print(f"Question: question")
print(f"Answer: answer")

Output>>
Question: How many cats are in the image?
Answer: 2

As you can see, we could get the model to answer the questions nicely. Let’s try out the other example question.

question = "What are the object beside the cats?"

inputs = processor(image, question, return_tensors="pt")
out = model.generate(**inputs)
answer = processor.decode(out[0], skip_special_tokens=True)

print(f"\nQuestion: question")
print(f"Answer: answer")

Output>>
Question: What are the objects beside the cats?
Answer: remotes

The model could perfectly answer the text question we give to the model. This is an example of a multimodal model where we can get the text and image input with the text output.

We can also fine-tune the VQA model for more specific use cases. For example, we have a radiology image dataset with health-specific questions, and we would fine-tune the BLIP model with this dataset.

Let’s load the dataset to start the fine-tuning process.

import torch
from transformers import BlipProcessor, BlipForQuestionAnswering, TrainingArguments, Trainer
from datasets import load_dataset
from PIL import Image
import requests
import io

dataset = load_dataset("flaviagiammarino/vqa-rad")
dataset

Output>>
DatasetDict(
    train: Dataset(
        features: ['image', 'question', 'answer'],
        num_rows: 1793
    )
    test: Dataset(
        features: ['image', 'question', 'answer'],
        num_rows: 451
    )
)

We would preprocess the image and text data from the dataset above using the BLIP processor.

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")


def preprocess_function(examples):
    questions = [q.strip() for q in examples["question"]]
    images = examples["image"]  
    inputs = processor(images=images, text=questions, padding="max_length", truncation=True, return_tensors="pt")
   
    targets = processor(text=examples["answer"], padding="max_length", truncation=True, return_tensors="pt")
    inputs['labels'] = targets['input_ids']
   
    return k: v.to(device) for k, v in inputs.items()


processed_dataset = dataset.map(
    preprocess_function,
    batched=True,
    remove_columns=dataset["train"].column_names,
    num_proc=4 if device.type == "cpu" else 1,
)

Once the dataset is ready, we will start the fine-tuning process.

training_args = TrainingArguments(
    output_dir="./vqa_blip_rad_finetuned",
    learning_rate=5e-5,
    per_device_train_batch_size=8 if device.type == "cuda" else 4,  
    per_device_eval_batch_size=8 if device.type == "cuda" else 4,
    num_train_epochs=3,
    weight_decay=0.01,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    fp16=device.type == "cuda",
)


trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=processed_dataset["train"],
    eval_dataset=processed_dataset["test"],
    tokenizer=processor,
)


trainer.train()

Output>>


Epoch
Training Loss
Validation Loss
1
No log
0.020441
2
No log
0.017374
3
1.105900
0.016882

TrainOutput(global_step=675, training_loss=0.8215750219203808, metrics='train_runtime': 2545.0967, 'train_samples_per_second': 2.113, 'train_steps_per_second': 0.265, 'total_flos': 5568706140991488.0, 'train_loss': 0.8215750219203808, 'epoch': 3.0)

With the model fine-tuning process ready, we would test them to see their capability. We would use the test sample data that is already available.

test_example = dataset['test'][2]
test_image = test_example['image']
test_question = test_example['question']


inputs = processor(test_image, test_question, return_tensors="pt").to(device)
with torch.no_grad():
    out = model.generate(**inputs)
answer = processor.decode(out[0], skip_special_tokens=True)


print(f"Question: test_question")
print(f"Answer: answer")
print(f"Ground Truth: test_example['answer']")

Output>>
Question: is there any intraparenchymal abnormalities in the lung fields?
Answer: no
Ground Truth: no

If you already find the model satisfying, you can save the model and processor.

model.save_pretrained("./vqa_blip_rad_finetuned")
processor.save_pretrained("./vqa_blip_rad_finetuned")

That’s all for the VQA multimodal model; let’s try another multimodal example with text-to-speech.

Implement the Multimodal Text-to-Speech (TTS) model

The text-to-speech model is a multimodal model that receives text input and provides audio output. It’s a model used in many business implementations, such as model avatars, virtual assistants, and content creation.

In this tutorial, we will use Microsoft’s SpeechT5 multimodal text-to-speech model, which has been pre-trained for speech synthesis. In between, we would also use the SpeechT5 HiFi-GAN Encoder to convert the intermediate representations of speech generated by the text-to-speech model into audio waveforms.

Let’s try it out with the following code.

import torch
from transformers import SpeechT5Processor, SpeechT5ForTextToSpeech, SpeechT5HifiGan
from datasets import load_dataset
import soundfile as sf

processor = SpeechT5Processor.from_pretrained("microsoft/speecht5_tts")
model = SpeechT5ForTextToSpeech.from_pretrained("microsoft/speecht5_tts")
vocoder = SpeechT5HifiGan.from_pretrained("microsoft/speecht5_hifigan")

Once we have the model and the processor, we can try to generate the audio file from text. Specifying the voice we want from the audio is also possible if we have the embedding data. In this case, we would use the speaker embedding dataset from cmu-artics-xvectors.

# Load xvector containing speaker's voice characteristics
embeddings_dataset = load_dataset("Matthijs/cmu-arctic-xvectors", split="validation")
speaker_embeddings = torch.tensor(embeddings_dataset[7306]["xvector"]).unsqueeze(0)

text = “Hello, we are trying out the text to speech model with Hugging Face Transformers.”
inputs = processor(text=text, return_tensors="pt")

speech = model.generate_speech(inputs["input_ids"], speaker_embeddings, vocoder=vocoder)

sf.write("output_speech.wav", speech.numpy(), samplerate=16000)

You should have the audio file saved in your directory now. Try to listen to them and see if you like it or not. Try to change the speaker embedding to see the variation of voice you can use.

Conclusion

The multimodal model is an exciting field as it opens up many applications that we can use for business. Hugging Face Transformer provides access to many multimodal that we can implement and fine-tune for downstream processes. This article discusses two multimodal model applications: the Visual-Questioning Answering (VQA) and Text-to-Speech (TTS) models.

Cornellius Yudha Wijaya is a data science assistant manager and data writer. While working full-time at Allianz Indonesia, he loves to share Python and data tips via social media and writing media. Cornellius writes on a variety of AI and machine learning topics.

Our Top 3 Course Recommendations

1. Google Cybersecurity Certificate – Get on the fast track to a career in cybersecurity.

2. Google Data Analytics Professional Certificate – Up your data analytics game

3. Google IT Support Professional Certificate – Support your organization in IT

Implementing Multimodal Models with Hugging Face Transformers

Preparation

Implement the Multimodal Visual-Question Answering (VQA) Model

Implement the Multimodal Text-to-Speech (TTS) model

Conclusion

Our Top 3 Course Recommendations

Recent Articles

Quick Hit #19 | CSS-Tricks

Announcing more Startup Battlefield judges at Disrupt 2024

Revolutionize logo design creation with Amazon Bedrock: Embracing generative art, dynamic logos, and AI collaboration

Mummy Apple game App cusTomer care helpline number ☎️7898037380/✅/07894693712 toll free…

The Cure is Cybersecurity Hygiene

Related Stories

Leave A Reply Cancel reply