Image by Editor | Ideogram
Machine learning models have made a lot of progress in the past year. Artificial intelligence seems able to do anything that might be close to that. Speaking of models that can do anything, this is where we will discuss Multimodal Models.
Multimodal models are not your typical machine learning model, as they are designed to understand multiple data types (modalities). The data can be anything, such as text, images, audio, video, and more. The model could also generate different types of data. Application examples include a text-to-speech model, visual questioning answering model, image captioning model, etc.
In this article, we will try to implement multimodal models with Hugging Face Transformers. The open-source company has hosted many pre-trained models that we can use, including the multimodal model.
How can we do that? Let’s get into it.
Preparation
Let’s start with installing the required packages for this tutorial.
pip install transformers datasets pillow requests
You also want to install the PyTorch package, as it’s an integral part of our activity.
Once everything is installed perfectly, we will jump into the main article.
Implement the Multimodal Visual-Question Answering (VQA) Model
The visual-questions-answering model can answer questions about images by providing textual questions. This model would receive input from both the image and text input and output the text.
This article will use a pre-trained multimodal model called BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation). This is a base model for the VQA use case and could be fine-tuned for any downstream process.
Let’s try the base model to perform VQA with the image example. For the sample, we would use the sample cat image from the COCO dataset. First, let’s import the necessary package and set up the image loader for the model.
import torch
from transformers import BlipProcessor, BlipForQuestionAnswering
from PIL import Image
import requests
def load_image(image_url):
return Image.open(requests.get(image_url, stream=True).raw).convert('RGB')
The image we acquire will be converted into an RGB colour image.
Next, we would load the BLIP model and the processor to process the image data. Hugging Face Transformer already has all the necessary pipelines, and we only need to use them.
processor = BlipProcessor.from_pretrained("Salesforce/blip-vqa-base")
model = BlipForQuestionAnswering.from_pretrained("Salesforce/blip-vqa-base")
With the model and processor ready, we would try to answer the questions about the image.
image_url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = load_image(image_url)
# Example question
question = "How many cats are in the image?"
# Preparing input and generating answer output
inputs = processor(image, question, return_tensors="pt")
out = model.generate(**inputs)
answer = processor.decode(out[0], skip_special_tokens=True)
print(f"Question: question")
print(f"Answer: answer")
Output>>
Question: How many cats are in the image?
Answer: 2
As you can see, we could get the model to answer the questions nicely. Let’s try out the other example question.
question = "What are the object beside the cats?"
inputs = processor(image, question, return_tensors="pt")
out = model.generate(**inputs)
answer = processor.decode(out[0], skip_special_tokens=True)
print(f"\nQuestion: question")
print(f"Answer: answer")
Output>>
Question: What are the objects beside the cats?
Answer: remotes
The model could perfectly answer the text question we give to the model. This is an example of a multimodal model where we can get the text and image input with the text output.
We can also fine-tune the VQA model for more specific use cases. For example, we have a radiology image dataset with health-specific questions, and we would fine-tune the BLIP model with this dataset.
Let’s load the dataset to start the fine-tuning process.
import torch
from transformers import BlipProcessor, BlipForQuestionAnswering, TrainingArguments, Trainer
from datasets import load_dataset
from PIL import Image
import requests
import io
dataset = load_dataset("flaviagiammarino/vqa-rad")
dataset
Output>>
DatasetDict(
train: Dataset(
features: ['image', 'question', 'answer'],
num_rows: 1793
)
test: Dataset(
features: ['image', 'question', 'answer'],
num_rows: 451
)
)
We would preprocess the image and text data from the dataset above using the BLIP processor.
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
def preprocess_function(examples):
questions = [q.strip() for q in examples["question"]]
images = examples["image"]
inputs = processor(images=images, text=questions, padding="max_length", truncation=True, return_tensors="pt")
targets = processor(text=examples["answer"], padding="max_length", truncation=True, return_tensors="pt")
inputs['labels'] = targets['input_ids']
return k: v.to(device) for k, v in inputs.items()
processed_dataset = dataset.map(
preprocess_function,
batched=True,
remove_columns=dataset["train"].column_names,
num_proc=4 if device.type == "cpu" else 1,
)
Once the dataset is ready, we will start the fine-tuning process.
training_args = TrainingArguments(
output_dir="./vqa_blip_rad_finetuned",
learning_rate=5e-5,
per_device_train_batch_size=8 if device.type == "cuda" else 4,
per_device_eval_batch_size=8 if device.type == "cuda" else 4,
num_train_epochs=3,
weight_decay=0.01,
evaluation_strategy="epoch",
save_strategy="epoch",
load_best_model_at_end=True,
fp16=device.type == "cuda",
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=processed_dataset["train"],
eval_dataset=processed_dataset["test"],
tokenizer=processor,
)
trainer.train()
Output>>
Epoch
Training Loss
Validation Loss
1
No log
0.020441
2
No log
0.017374
3
1.105900
0.016882
TrainOutput(global_step=675, training_loss=0.8215750219203808, metrics='train_runtime': 2545.0967, 'train_samples_per_second': 2.113, 'train_steps_per_second': 0.265, 'total_flos': 5568706140991488.0, 'train_loss': 0.8215750219203808, 'epoch': 3.0)
With the model fine-tuning process ready, we would test them to see their capability. We would use the test sample data that is already available.
test_example = dataset['test'][2]
test_image = test_example['image']
test_question = test_example['question']
inputs = processor(test_image, test_question, return_tensors="pt").to(device)
with torch.no_grad():
out = model.generate(**inputs)
answer = processor.decode(out[0], skip_special_tokens=True)
print(f"Question: test_question")
print(f"Answer: answer")
print(f"Ground Truth: test_example['answer']")
Output>>
Question: is there any intraparenchymal abnormalities in the lung fields?
Answer: no
Ground Truth: no
If you already find the model satisfying, you can save the model and processor.
model.save_pretrained("./vqa_blip_rad_finetuned")
processor.save_pretrained("./vqa_blip_rad_finetuned")
That’s all for the VQA multimodal model; let’s try another multimodal example with text-to-speech.
Implement the Multimodal Text-to-Speech (TTS) model
The text-to-speech model is a multimodal model that receives text input and provides audio output. It’s a model used in many business implementations, such as model avatars, virtual assistants, and content creation.
In this tutorial, we will use Microsoft’s SpeechT5 multimodal text-to-speech model, which has been pre-trained for speech synthesis. In between, we would also use the SpeechT5 HiFi-GAN Encoder to convert the intermediate representations of speech generated by the text-to-speech model into audio waveforms.
Let’s try it out with the following code.
import torch
from transformers import SpeechT5Processor, SpeechT5ForTextToSpeech, SpeechT5HifiGan
from datasets import load_dataset
import soundfile as sf
processor = SpeechT5Processor.from_pretrained("microsoft/speecht5_tts")
model = SpeechT5ForTextToSpeech.from_pretrained("microsoft/speecht5_tts")
vocoder = SpeechT5HifiGan.from_pretrained("microsoft/speecht5_hifigan")
Once we have the model and the processor, we can try to generate the audio file from text. Specifying the voice we want from the audio is also possible if we have the embedding data. In this case, we would use the speaker embedding dataset from cmu-artics-xvectors.
# Load xvector containing speaker's voice characteristics
embeddings_dataset = load_dataset("Matthijs/cmu-arctic-xvectors", split="validation")
speaker_embeddings = torch.tensor(embeddings_dataset[7306]["xvector"]).unsqueeze(0)
text = “Hello, we are trying out the text to speech model with Hugging Face Transformers.”
inputs = processor(text=text, return_tensors="pt")
speech = model.generate_speech(inputs["input_ids"], speaker_embeddings, vocoder=vocoder)
sf.write("output_speech.wav", speech.numpy(), samplerate=16000)
You should have the audio file saved in your directory now. Try to listen to them and see if you like it or not. Try to change the speaker embedding to see the variation of voice you can use.
Conclusion
The multimodal model is an exciting field as it opens up many applications that we can use for business. Hugging Face Transformer provides access to many multimodal that we can implement and fine-tune for downstream processes. This article discusses two multimodal model applications: the Visual-Questioning Answering (VQA) and Text-to-Speech (TTS) models.
Cornellius Yudha Wijaya is a data science assistant manager and data writer. While working full-time at Allianz Indonesia, he loves to share Python and data tips via social media and writing media. Cornellius writes on a variety of AI and machine learning topics.
Our Top 3 Course Recommendations
1. Google Cybersecurity Certificate – Get on the fast track to a career in cybersecurity.
2. Google Data Analytics Professional Certificate – Up your data analytics game
3. Google IT Support Professional Certificate – Support your organization in IT