Implementing Multi-Modal RAG Systems – MachineLearningMastery.com

Implementing Multi-Modal RAG Systems
Image by Author | Ideogram

Large language models (LLMs) have evolved and permeated our lives so much and so quickly that many we have become dependent on them in all sorts of scenarios. When people understand that products such as ChatGPT for text generation are so helpful, few are able to avoid depending on them. However, sometimes the answer is inaccurate, prompting an output enhancement technique such as retrieval-augmented generation, or RAG.

RAG is a framework that enhances the LLM output by incorporating real-time retrieval of external knowledge. Multi-modal RAG systems take this a step further by enabling the retrieval and processing of information across multiple data formats, such as text and image data.

In this article, we will implement multi-modal RAG using text, audio, and image data.

Multi-Modal RAG System

Multi-modal RAG systems involve implementing multiple dataset types to achieve better output by accessing our knowledge base. There are many ways to implement them, but what’s important is to create a system that works well in production rather than one that is fancy.

In this tutorial, we will enhance the RAG system by building a knowledge base with both image and audio data. For the entire code base, you can visit the following GitHub repository.

The workflow can be summarized in the image below.

It’s a bit small to read as it is, so click to enlarge or save and zoom in as required. The workflow can be summarized into seven steps, which are:

Extract Images
Embed Images
Store Image Embeddings
Process Audio
Store Audio Embeddings
Retrieve Data
Generate and Output Response

As this will require high resources, we will use Google Colab with access to a GPU. More specifically, we will use the A100 GPU, as the RAM requirement for this tutorial is relatively high.

Let’s start by installing all the libraries that are important for our tutorial.

pip install pdf2image Pillow chromadb torch torchvision torchaudio transformers librosa ipython open-clip-torch qwen_vl_utils

pip install pdf2image Pillow chromadb torch torchvision torchaudio transformers librosa ipython open–clip–torch qwen_vl_utils

You can visit the PyTorch website to see which one works for your systems and environment.

Additionally, there are times when image extraction from PDF does not work correctly. If this happens, you should install the following tool.

apt-get update apt-get install -y poppler-utils

apt–get update

apt–get install –y poppler–utils

With the environment and the tools ready, we will import all the necessary libraries.

import os from pdf2image import convert_from_path from PIL import Image import chromadb from chromadb.utils.embedding_functions import OpenCLIPEmbeddingFunction import torch from transformers import CLIPProcessor, CLIPModel, WhisperProcessor, WhisperForConditionalGeneration, Qwen2VLForConditionalGeneration, Qwen2VLProcessor import librosa from sentence_transformers import SentenceTransformer from qwen_vl_utils import process_vision_info from IPython.display import display, Image as IPImage

import os

from pdf2image import convert_from_path

from PIL import Image

import chromadb

from chromadb.utils.embedding_functions import OpenCLIPEmbeddingFunction

import torch

from transformers import CLIPProcessor, CLIPModel, WhisperProcessor, WhisperForConditionalGeneration, Qwen2VLForConditionalGeneration, Qwen2VLProcessor

import librosa

from sentence_transformers import SentenceTransformer

from qwen_vl_utils import process_vision_info

from IPython.display import display, Image as IPImage

In this tutorial, we will use both the image data from the PDFs and the audio files (.mp3) that we prepared previously. We will use the Short Cooking Recipe from Unilever for the PDF file and the Gordon Ramsay Cooking Audio file from YouTube. You can find both files in the dataset folder in the GitHub repository.

Put all the files in the dataset folder, and we are ready to go.

We will start by processing the image data from a PDF file. To do that, we will extract each of the PDF pages as an image with the following code.

output_dir = “dataset” image_output_dir = “extracted_images” def convert_pdfs_to_images(folder, image_output_dir): if not os.path.exists(image_output_dir): os.makedirs(image_output_dir) pdf_files = [f for f in os.listdir(folder) if f.endswith(‘.pdf’)] all_images = for doc_id, pdf_file in enumerate(pdf_files): pdf_path = os.path.join(folder, pdf_file) images = convert_from_path(pdf_path, dpi=100) image_paths = [] for i, image in enumerate(images): image_path = os.path.join(image_output_dir, f”doc_id_page_i.png”) image.save(image_path, “PNG”) image_paths.append(image_path) all_images[doc_id] = image_paths return all_images all_images = convert_pdfs_to_images(output_dir, image_output_dir)

output_dir = “dataset”

image_output_dir = “extracted_images”

def convert_pdfs_to_images(folder, image_output_dir):

if not os.path.exists(image_output_dir):

os.makedirs(image_output_dir)

pdf_files = [f for f in os.listdir(folder) if f.endswith(‘.pdf’)]

all_images =

for doc_id, pdf_file in enumerate(pdf_files):

pdf_path = os.path.join(folder, pdf_file)

images = convert_from_path(pdf_path, dpi=100)

image_paths = []

for i, image in enumerate(images):

image_path = os.path.join(image_output_dir, f“doc_id_page_i.png”)

image.save(image_path, “PNG”)

image_paths.append(image_path)

all_images[doc_id] = image_paths

return all_images

all_images = convert_pdfs_to_images(output_dir, image_output_dir)

Once all the images are extracted from the PDF file, we will generate image embedding with the CLIP model. CLIP is a multi-modal model developed by OpenAI which is designed to understand the relationship between image and text data.

In our pipeline, we use CLIP to generate image embeddings that we will store in the ChromaDB vector database later and use to retrieve relevant images based on text queries.

To generate the image embedding, we will use the following code.

device = “cuda” if torch.cuda.is_available() else “cpu” model = CLIPModel.from_pretrained(“openai/clip-vit-base-patch32”).to(device) processor = CLIPProcessor.from_pretrained(“openai/clip-vit-base-patch32″) def embed_images(image_paths): embeddings = [] for path in image_paths: image = Image.open(path) inputs = processor(images=image, return_tensors=”pt”, padding=True).to(device) with torch.no_grad(): image_embedding = model.get_image_features(**inputs).cpu().numpy() embeddings.append(image_embedding) return embeddings image_embeddings = for doc_id, paths in all_images.items(): image_embeddings[doc_id] = embed_images(paths)

device = “cuda” if torch.cuda.is_available() else “cpu”

model = CLIPModel.from_pretrained(“openai/clip-vit-base-patch32”).to(device)

processor = CLIPProcessor.from_pretrained(“openai/clip-vit-base-patch32”)

def embed_images(image_paths):

embeddings = []

for path in image_paths:

image = Image.open(path)

inputs = processor(images=image, return_tensors=“pt”, padding=True).to(device)

with torch.no_grad():

image_embedding = model.get_image_features(**inputs).cpu().numpy()

embeddings.append(image_embedding)

return embeddings

image_embeddings =

for doc_id, paths in all_images.items():

image_embeddings[doc_id] = embed_images(paths)

Next, we will process the audio data to generate the text transcription using the Whisper model. Whisper is an OpenAI model that uses transformer-based architecture to generate text from audio input.

We are not using Whisper for embedding in our pipeline. Instead, it will only be responsible for audio transcription. We will transcribe them in chunks before we use sentence transformers to generate embeddings for the transcription chunks.

To process the audio transcription, we will use the following code.

whisper_processor = WhisperProcessor.from_pretrained(“openai/whisper-small”) whisper_model = WhisperForConditionalGeneration.from_pretrained(“openai/whisper-small”).to(device) # Define how much text per chunk def transcribe_audio(audio_path, chunk_length=30): audio, sr = librosa.load(audio_path, sr=16000) chunk_size = chunk_length * sr chunks = for i in range(0, len(audio), chunk_size)] transcription_chunks = [] for chunk in chunks: inputs = whisper_processor(chunk, sampling_rate=sr, return_tensors=”pt”).to(device) inputs[“attention_mask”] = torch.ones_like(inputs.input_features) with torch.no_grad(): predicted_ids = whisper_model.generate(**inputs, max_length=448) chunk_transcription = whisper_processor.batch_decode(predicted_ids, skip_special_tokens=True)[0] transcription_chunks.append(chunk_transcription) full_transcription = ” “.join(transcription_chunks) return full_transcription, transcription_chunks audio_files = [f for f in os.listdir(output_dir) if f.endswith(‘.mp3’)] audio_transcriptions = for audio_id, audio_file in enumerate(audio_files): audio_path = os.path.join(output_dir, audio_file) full_transcription, transcription_chunks = transcribe_audio(audio_path) audio_transcriptions[audio_id] = “full_transcription”: full_transcription, “chunks”: transcription_chunks

whisper_processor = WhisperProcessor.from_pretrained(“openai/whisper-small”)

whisper_model = WhisperForConditionalGeneration.from_pretrained(“openai/whisper-small”).to(device)

# Define how much text per chunk

def transcribe_audio(audio_path, chunk_length=30):

audio, sr = librosa.load(audio_path, sr=16000)

chunk_size = chunk_length * sr

chunks = [audio[i:i + chunk_size] for i in range(0, len(audio), chunk_size)]

transcription_chunks = []

for chunk in chunks:

inputs = whisper_processor(chunk, sampling_rate=sr, return_tensors=“pt”).to(device)

inputs[“attention_mask”] = torch.ones_like(inputs.input_features)

with torch.no_grad():

predicted_ids = whisper_model.generate(**inputs, max_length=448)

chunk_transcription = whisper_processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]

transcription_chunks.append(chunk_transcription)

full_transcription = ” “.join(transcription_chunks)

return full_transcription, transcription_chunks

audio_files = [f for f in os.listdir(output_dir) if f.endswith(‘.mp3’)]

audio_transcriptions =

for audio_id, audio_file in enumerate(audio_files):

audio_path = os.path.join(output_dir, audio_file)

full_transcription, transcription_chunks = transcribe_audio(audio_path)

audio_transcriptions[audio_id] =

“full_transcription”: full_transcription,

“chunks”: transcription_chunks

With everything in place, we will store our embeddings in the ChromaDB vector database. We will separate the image and audio transcription data as they have different embedding characteristics. We will also initiate the embedding functions for both the image and audio transcription data.

client = chromadb.PersistentClient(path=”chroma_db”) embedding_function = OpenCLIPEmbeddingFunction() text_embedding_model = SentenceTransformer(‘sentence-transformers/all-MiniLM-L6-v2’) # Delete existing collections (if needed) try: client.delete_collection(name=”image_collection”) client.delete_collection(name=”audio_collection”) print(“Deleted existing collections.”) except Exception as e: print(f”Collections do not exist or could not be deleted: e”) image_collection = client.create_collection(name=”image_collection”, embedding_function=embedding_function) audio_collection = client.create_collection(name=”audio_collection”) for doc_id, embeddings in image_embeddings.items(): for i, embedding in enumerate(embeddings): image_collection.add( ids=[f”image_doc_id_i”], embeddings=[embedding.flatten().tolist()], metadatas=[“doc_id”: str(doc_id), “image_path”: all_images[doc_id][i]] ) for audio_id, transcription_data in audio_transcriptions.items(): transcription_chunks = transcription_data[“chunks”] for chunk_id, chunk in enumerate(transcription_chunks): chunk_embedding = text_embedding_model.encode(chunk) audio_collection.add( ids=[f”audio_audio_id_chunk_chunk_id”], embeddings=[chunk_embedding.tolist()], metadatas=[ “audio_id”: str(audio_id), “audio_path”: audio_files[audio_id], “chunk_id”: str(chunk_id) ], documents=[chunk] )

client = chromadb.PersistentClient(path=“chroma_db”)

embedding_function = OpenCLIPEmbeddingFunction()

text_embedding_model = SentenceTransformer(‘sentence-transformers/all-MiniLM-L6-v2’)

# Delete existing collections (if needed)

try:

client.delete_collection(name=“image_collection”)

client.delete_collection(name=“audio_collection”)

print(“Deleted existing collections.”)

except Exception as e:

print(f“Collections do not exist or could not be deleted: e”)

image_collection = client.create_collection(name=“image_collection”, embedding_function=embedding_function)

audio_collection = client.create_collection(name=“audio_collection”)

for doc_id, embeddings in image_embeddings.items():

for i, embedding in enumerate(embeddings):

image_collection.add(

ids=[f“image_doc_id_i”],

embeddings=[embedding.flatten().tolist()],

metadatas=[“doc_id”: str(doc_id), “image_path”: all_images[doc_id][i]]

)

for audio_id, transcription_data in audio_transcriptions.items():

transcription_chunks = transcription_data[“chunks”]

for chunk_id, chunk in enumerate(transcription_chunks):

chunk_embedding = text_embedding_model.encode(chunk)

audio_collection.add(

ids=[f“audio_audio_id_chunk_chunk_id”],

embeddings=[chunk_embedding.tolist()],

metadatas=[

“audio_id”: str(audio_id),

“audio_path”: audio_files[audio_id],

“chunk_id”: str(chunk_id)

documents=[chunk]

)

Our RAG system is almost ready! The only thing left to do is set up the retrieval system from the ChromaDB vector database.

For example, let’s try retrieving the top two results from both an image and an audio file using a text query.

def retrieve_data(query, top_k=2): query_embedding_image = embedding_function([query])[0] # OpenCLIP for image collection query_embedding_audio = text_embedding_model.encode(query) # SentenceTransformer for audio collection image_results = image_collection.query( query_embeddings=[query_embedding_image], n_results=top_k ) audio_results = audio_collection.query( query_embeddings=[query_embedding_audio.tolist()], n_results=top_k ) retrieved_images = [metadata[“image_path”] for metadata in image_results[“metadatas”][0] if “image_path” in metadata] retrieved_chunks = audio_results[“documents”][0] if “documents” in audio_results else [] return retrieved_images, retrieved_chunks query = “What are the healthiest ingredients to use in the recipe you have?” retrieved_images, retrieved_chunks = retrieve_data(query) print(“Retrieved Images:”, retrieved_images) print(“Retrieved Audio Chunks:”, retrieved_chunks)

def retrieve_data(query, top_k=2):

query_embedding_image = embedding_function([query])[0] # OpenCLIP for image collection

query_embedding_audio = text_embedding_model.encode(query) # SentenceTransformer for audio collection

image_results = image_collection.query(

query_embeddings=[query_embedding_image],

n_results=top_k

)

audio_results = audio_collection.query(

query_embeddings=[query_embedding_audio.tolist()],

n_results=top_k

)

retrieved_images = [metadata[“image_path”] for metadata in image_results[“metadatas”][0] if “image_path” in metadata]

retrieved_chunks = audio_results[“documents”][0] if “documents” in audio_results else []

return retrieved_images, retrieved_chunks

query = “What are the healthiest ingredients to use in the recipe you have?”

retrieved_images, retrieved_chunks = retrieve_data(query)

print(“Retrieved Images:”, retrieved_images)

print(“Retrieved Audio Chunks:”, retrieved_chunks)

The result for both retrievals is shown in the output below.

Retrieved Images: [‘extracted_images/0_page_3.png’, ‘extracted_images/0_page_12.png’] Retrieved Audio Chunks: [” Lemon. Zest the lemon. Over. Smells incredible. And then finally seal the deal with a touch of grated parmesan cheese. Give your veg some attitude and you’ll get amazingly elegant dishes on a budget that are always guaranteed to impress. What more do you want from great cooking? Cheap to make, easy to cook and absolutely stunning. For me, food always has to be impressive. But when it comes to desserts,”, ” and one third of your protein, chicken. With a dish that takes literally minutes to put together, it’s really important to get everything organized. Everything needs to be at your fingertips. Touch of olive oil. Get that pan really nice and ready. Just starting to smoke. Drop the chicken in first. Just salt, pepper. Open up those little strands of chicken.”]

Retrieved Images: [‘extracted_images/0_page_3.png’, ‘extracted_images/0_page_12.png’]

Retrieved Audio Chunks: [” Lemon. Zest the lemon. Over. Smells incredible. And then finally seal the deal with a touch of grated parmesan cheese. Give your veg some attitude and you’ll get amazingly elegant dishes on a budget that are always guaranteed to impress. What more do you want from great cooking? Cheap to make, easy to cook and absolutely stunning. For me, food always has to be impressive. But when it comes to desserts,”, ” and one third of your protein, chicken. With a dish that takes literally minutes to put together, it’s really important to get everything organized. Everything needs to be at your fingertips. Touch of olive oil. Get that pan really nice and ready. Just starting to smoke. Drop the chicken in first. Just salt, pepper. Open up those little strands of chicken.”]

For image retrieval, it will return the metadata image path we stored in the vector database. For audio retrieval, it returns the transcription chunk most related to the text query.

With the data retrieval, we will set up the generative model using the Qwen-VL model. The model is a multi-modal LLM that can handle text and image data and generate text responses from the multi-modal data we pass into it.

We use the Qwen-VL model, which generates multimodal text responses by taking both retrieved images and audio transcription chunks.

Let’s set up the model with the following code.

vl_model = Qwen2VLForConditionalGeneration.from_pretrained( “Qwen/Qwen2-VL-7B-Instruct”, torch_dtype=torch.bfloat16, ).cuda().eval() min_pixels = 256 * 256 max_pixels = 1024 * 1024 vl_model_processor = Qwen2VLProcessor.from_pretrained( “Qwen/Qwen2-VL-7B-Instruct”, min_pixels=min_pixels, max_pixels=max_pixels )

vl_model = Qwen2VLForConditionalGeneration.from_pretrained(

“Qwen/Qwen2-VL-7B-Instruct”,

torch_dtype=torch.bfloat16,

).cuda().eval()

min_pixels = 256 * 256

max_pixels = 1024 * 1024

vl_model_processor = Qwen2VLProcessor.from_pretrained(

“Qwen/Qwen2-VL-7B-Instruct”,

min_pixels=min_pixels,

max_pixels=max_pixels

)

Then, we set up the model to accept the input data, process them, and generate text output.

chat_template = [ “role”: “user”, “content”: [ “type”: “image”, “image”: retrieved_images[0], # First retrieved image “type”: “image”, “image”: retrieved_images[1], # Second retrieved image “type”: “text”, “text”: query, # User query “type”: “text”, “text”: “Audio Context: ” + ” “.join(retrieved_chunks) # Include audio data ], ] text = vl_model_processor.apply_chat_template( chat_template, tokenize=False, add_generation_prompt=True ) image_inputs, _ = process_vision_info(chat_template) inputs = vl_model_processor( text=[text], images=image_inputs, padding=True, return_tensors=”pt”, ).to(“cuda”) generated_ids = vl_model.generate(**inputs, max_new_tokens=100) generated_ids_trimmed = [ out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids) ] output_text = vl_model_processor.batch_decode( generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False ) print(output_text[0])

chat_template = [

“role”: “user”,

“content”: [

“type”: “image”, “image”: retrieved_images[0], # First retrieved image

“type”: “image”, “image”: retrieved_images[1], # Second retrieved image

“type”: “text”, “text”: query, # User query

“type”: “text”, “text”: “Audio Context: “ + ” “.join(retrieved_chunks) # Include audio data

]

text = vl_model_processor.apply_chat_template(

chat_template, tokenize=False, add_generation_prompt=True

)

image_inputs, _ = process_vision_info(chat_template)

inputs = vl_model_processor(

text=[text],

images=image_inputs,

padding=True,

return_tensors=“pt”,

).to(“cuda”)

generated_ids = vl_model.generate(**inputs, max_new_tokens=100)

generated_ids_trimmed = [

out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)

]

output_text = vl_model_processor.batch_decode(

generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False

)

print(output_text[0])

The result is shown in the output below.

The healthiest ingredients to use in the recipe are: 1. **Lemon** – Provides a burst of citrus flavor and is a good source of vitamin C. 2. **Parmesan Cheese** – A good source of calcium and protein. 3. **Chicken** – A lean protein source that is rich in essential amino acids. 4. **Olive Oil** – A healthy fat that is rich in monounsaturated fatty acids. 5. **Zest** – Adds a burst of flavor

The healthiest ingredients to use in the recipe are:

1. **Lemon** – Provides a burst of citrus flavor and is a good source of vitamin C.

2. **Parmesan Cheese** – A good source of calcium and protein.

3. **Chicken** – A lean protein source that is rich in essential amino acids.

4. **Olive Oil** – A healthy fat that is rich in monounsaturated fatty acids.

5. **Zest** – Adds a burst of flavor

As you can see, the result takes into account both the image and audio data.

That’s all you need to build a multi-modal RAG system. You can change the file and code to accommodate your needs.

Conclusion

Retrieval-augmented generation, or RAG, is a framework that enhances LLM output using external knowledge. In multi-modal RAG systems, we utilize data other than simple text, such as image and audio data.

In this article, we have implemented multi-modal RAG using text, audio, and image data. We are using CLIP for image embeddings, Whisper for audio transcription, SentenceTransformer for text embeddings, ChromaDB for vector storage, and Qwen-VL for multimodal text generation.

I hope this has helped!

Implementing Multi-Modal RAG Systems – MachineLearningMastery.com

Multi-Modal RAG System

Conclusion

Recent Articles

Meta, X approved ads containing violent anti-Muslim, antisemitic hate speech ahead of German election, study finds

Becoming an Machine Learning Engineer in 2025

The Next AI Revolution: A Tutorial Using VAEs to Generate High-Quality Synthetic Data

Fake job offers target coders with infostealers

“AI and Humanity: Partners in Progress, Not Rivals” | by Shivam | Feb, 2025

Related Stories

Leave A Reply Cancel reply