data:image/s3,"s3://crabby-images/fc9f7/fc9f74e18b3f774bf959093861e42d2165aefa51" alt="Implementing Multi-Modal RAG Systems"
Implementing Multi-Modal RAG Systems
Image by Author | Ideogram
Large language models (LLMs) have evolved and permeated our lives so much and so quickly that many we have become dependent on them in all sorts of scenarios. When people understand that products such as ChatGPT for text generation are so helpful, few are able to avoid depending on them. However, sometimes the answer is inaccurate, prompting an output enhancement technique such as retrieval-augmented generation, or RAG.
RAG is a framework that enhances the LLM output by incorporating real-time retrieval of external knowledge. Multi-modal RAG systems take this a step further by enabling the retrieval and processing of information across multiple data formats, such as text and image data.
In this article, we will implement multi-modal RAG using text, audio, and image data.
Multi-Modal RAG System
Multi-modal RAG systems involve implementing multiple dataset types to achieve better output by accessing our knowledge base. There are many ways to implement them, but what’s important is to create a system that works well in production rather than one that is fancy.
In this tutorial, we will enhance the RAG system by building a knowledge base with both image and audio data. For the entire code base, you can visit the following GitHub repository.
The workflow can be summarized in the image below.
It’s a bit small to read as it is, so click to enlarge or save and zoom in as required. The workflow can be summarized into seven steps, which are:
- Extract Images
- Embed Images
- Store Image Embeddings
- Process Audio
- Store Audio Embeddings
- Retrieve Data
- Generate and Output Response
As this will require high resources, we will use Google Colab with access to a GPU. More specifically, we will use the A100 GPU, as the RAM requirement for this tutorial is relatively high.
Let’s start by installing all the libraries that are important for our tutorial.
pip install pdf2image Pillow chromadb torch torchvision torchaudio transformers librosa ipython open–clip–torch qwen_vl_utils |
You can visit the PyTorch website to see which one works for your systems and environment.
Additionally, there are times when image extraction from PDF does not work correctly. If this happens, you should install the following tool.
apt–get update apt–get install –y poppler–utils |
With the environment and the tools ready, we will import all the necessary libraries.
import os from pdf2image import convert_from_path from PIL import Image import chromadb from chromadb.utils.embedding_functions import OpenCLIPEmbeddingFunction import torch from transformers import CLIPProcessor, CLIPModel, WhisperProcessor, WhisperForConditionalGeneration, Qwen2VLForConditionalGeneration, Qwen2VLProcessor import librosa from sentence_transformers import SentenceTransformer from qwen_vl_utils import process_vision_info from IPython.display import display, Image as IPImage |
In this tutorial, we will use both the image data from the PDFs and the audio files (.mp3) that we prepared previously. We will use the Short Cooking Recipe from Unilever for the PDF file and the Gordon Ramsay Cooking Audio file from YouTube. You can find both files in the dataset folder in the GitHub repository.
Put all the files in the dataset folder, and we are ready to go.
We will start by processing the image data from a PDF file. To do that, we will extract each of the PDF pages as an image with the following code.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 |
output_dir = “dataset” image_output_dir = “extracted_images”
def convert_pdfs_to_images(folder, image_output_dir): if not os.path.exists(image_output_dir): os.makedirs(image_output_dir)
pdf_files = [f for f in os.listdir(folder) if f.endswith(‘.pdf’)] all_images =
for doc_id, pdf_file in enumerate(pdf_files): pdf_path = os.path.join(folder, pdf_file) images = convert_from_path(pdf_path, dpi=100)
image_paths = [] for i, image in enumerate(images): image_path = os.path.join(image_output_dir, f“doc_id_page_i.png”) image.save(image_path, “PNG”) image_paths.append(image_path)
all_images[doc_id] = image_paths return all_images
all_images = convert_pdfs_to_images(output_dir, image_output_dir) |
Once all the images are extracted from the PDF file, we will generate image embedding with the CLIP model. CLIP is a multi-modal model developed by OpenAI which is designed to understand the relationship between image and text data.
In our pipeline, we use CLIP to generate image embeddings that we will store in the ChromaDB vector database later and use to retrieve relevant images based on text queries.
To generate the image embedding, we will use the following code.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
device = “cuda” if torch.cuda.is_available() else “cpu” model = CLIPModel.from_pretrained(“openai/clip-vit-base-patch32”).to(device) processor = CLIPProcessor.from_pretrained(“openai/clip-vit-base-patch32”)
def embed_images(image_paths): embeddings = [] for path in image_paths: image = Image.open(path) inputs = processor(images=image, return_tensors=“pt”, padding=True).to(device) with torch.no_grad(): image_embedding = model.get_image_features(**inputs).cpu().numpy() embeddings.append(image_embedding) return embeddings
image_embeddings = for doc_id, paths in all_images.items(): image_embeddings[doc_id] = embed_images(paths) |
Next, we will process the audio data to generate the text transcription using the Whisper model. Whisper is an OpenAI model that uses transformer-based architecture to generate text from audio input.
We are not using Whisper for embedding in our pipeline. Instead, it will only be responsible for audio transcription. We will transcribe them in chunks before we use sentence transformers to generate embeddings for the transcription chunks.
To process the audio transcription, we will use the following code.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 |
whisper_processor = WhisperProcessor.from_pretrained(“openai/whisper-small”) whisper_model = WhisperForConditionalGeneration.from_pretrained(“openai/whisper-small”).to(device)
# Define how much text per chunk def transcribe_audio(audio_path, chunk_length=30): audio, sr = librosa.load(audio_path, sr=16000) chunk_size = chunk_length * sr chunks = [audio[i:i + chunk_size] for i in range(0, len(audio), chunk_size)] transcription_chunks = [] for chunk in chunks: inputs = whisper_processor(chunk, sampling_rate=sr, return_tensors=“pt”).to(device) inputs[“attention_mask”] = torch.ones_like(inputs.input_features) with torch.no_grad(): predicted_ids = whisper_model.generate(**inputs, max_length=448) chunk_transcription = whisper_processor.batch_decode(predicted_ids, skip_special_tokens=True)[0] transcription_chunks.append(chunk_transcription)
full_transcription = ” “.join(transcription_chunks) return full_transcription, transcription_chunks
audio_files = [f for f in os.listdir(output_dir) if f.endswith(‘.mp3’)] audio_transcriptions = for audio_id, audio_file in enumerate(audio_files): audio_path = os.path.join(output_dir, audio_file) full_transcription, transcription_chunks = transcribe_audio(audio_path) audio_transcriptions[audio_id] = “full_transcription”: full_transcription, “chunks”: transcription_chunks
|
With everything in place, we will store our embeddings in the ChromaDB vector database. We will separate the image and audio transcription data as they have different embedding characteristics. We will also initiate the embedding functions for both the image and audio transcription data.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 |
client = chromadb.PersistentClient(path=“chroma_db”) embedding_function = OpenCLIPEmbeddingFunction() text_embedding_model = SentenceTransformer(‘sentence-transformers/all-MiniLM-L6-v2’)
# Delete existing collections (if needed) try: client.delete_collection(name=“image_collection”) client.delete_collection(name=“audio_collection”) print(“Deleted existing collections.”) except Exception as e: print(f“Collections do not exist or could not be deleted: e”)
image_collection = client.create_collection(name=“image_collection”, embedding_function=embedding_function) audio_collection = client.create_collection(name=“audio_collection”)
for doc_id, embeddings in image_embeddings.items(): for i, embedding in enumerate(embeddings): image_collection.add( ids=[f“image_doc_id_i”], embeddings=[embedding.flatten().tolist()], metadatas=[“doc_id”: str(doc_id), “image_path”: all_images[doc_id][i]] )
for audio_id, transcription_data in audio_transcriptions.items(): transcription_chunks = transcription_data[“chunks”] for chunk_id, chunk in enumerate(transcription_chunks): chunk_embedding = text_embedding_model.encode(chunk) audio_collection.add( ids=[f“audio_audio_id_chunk_chunk_id”], embeddings=[chunk_embedding.tolist()], metadatas=[ “audio_id”: str(audio_id), “audio_path”: audio_files[audio_id], “chunk_id”: str(chunk_id) ], documents=[chunk] ) |
Our RAG system is almost ready! The only thing left to do is set up the retrieval system from the ChromaDB vector database.
For example, let’s try retrieving the top two results from both an image and an audio file using a text query.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 |
def retrieve_data(query, top_k=2):
query_embedding_image = embedding_function([query])[0] # OpenCLIP for image collection query_embedding_audio = text_embedding_model.encode(query) # SentenceTransformer for audio collection
image_results = image_collection.query( query_embeddings=[query_embedding_image], n_results=top_k )
audio_results = audio_collection.query( query_embeddings=[query_embedding_audio.tolist()], n_results=top_k )
retrieved_images = [metadata[“image_path”] for metadata in image_results[“metadatas”][0] if “image_path” in metadata] retrieved_chunks = audio_results[“documents”][0] if “documents” in audio_results else []
return retrieved_images, retrieved_chunks
query = “What are the healthiest ingredients to use in the recipe you have?” retrieved_images, retrieved_chunks = retrieve_data(query) print(“Retrieved Images:”, retrieved_images) print(“Retrieved Audio Chunks:”, retrieved_chunks) |
The result for both retrievals is shown in the output below.
Retrieved Images: [‘extracted_images/0_page_3.png’, ‘extracted_images/0_page_12.png’]
Retrieved Audio Chunks: [” Lemon. Zest the lemon. Over. Smells incredible. And then finally seal the deal with a touch of grated parmesan cheese. Give your veg some attitude and you’ll get amazingly elegant dishes on a budget that are always guaranteed to impress. What more do you want from great cooking? Cheap to make, easy to cook and absolutely stunning. For me, food always has to be impressive. But when it comes to desserts,”, ” and one third of your protein, chicken. With a dish that takes literally minutes to put together, it’s really important to get everything organized. Everything needs to be at your fingertips. Touch of olive oil. Get that pan really nice and ready. Just starting to smoke. Drop the chicken in first. Just salt, pepper. Open up those little strands of chicken.”] |
For image retrieval, it will return the metadata image path we stored in the vector database. For audio retrieval, it returns the transcription chunk most related to the text query.
With the data retrieval, we will set up the generative model using the Qwen-VL model. The model is a multi-modal LLM that can handle text and image data and generate text responses from the multi-modal data we pass into it.
We use the Qwen-VL model, which generates multimodal text responses by taking both retrieved images and audio transcription chunks.
Let’s set up the model with the following code.
vl_model = Qwen2VLForConditionalGeneration.from_pretrained( “Qwen/Qwen2-VL-7B-Instruct”, torch_dtype=torch.bfloat16, ).cuda().eval()
min_pixels = 256 * 256 max_pixels = 1024 * 1024 vl_model_processor = Qwen2VLProcessor.from_pretrained( “Qwen/Qwen2-VL-7B-Instruct”, min_pixels=min_pixels, max_pixels=max_pixels ) |
Then, we set up the model to accept the input data, process them, and generate text output.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 |
chat_template = [
“role”: “user”, “content”: [ “type”: “image”, “image”: retrieved_images[0], # First retrieved image “type”: “image”, “image”: retrieved_images[1], # Second retrieved image “type”: “text”, “text”: query, # User query “type”: “text”, “text”: “Audio Context: “ + ” “.join(retrieved_chunks) # Include audio data ],
]
text = vl_model_processor.apply_chat_template( chat_template, tokenize=False, add_generation_prompt=True )
image_inputs, _ = process_vision_info(chat_template) inputs = vl_model_processor( text=[text], images=image_inputs, padding=True, return_tensors=“pt”, ).to(“cuda”)
generated_ids = vl_model.generate(**inputs, max_new_tokens=100) generated_ids_trimmed = [ out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids) ] output_text = vl_model_processor.batch_decode( generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False )
print(output_text[0]) |
The result is shown in the output below.
The healthiest ingredients to use in the recipe are:
1. **Lemon** – Provides a burst of citrus flavor and is a good source of vitamin C. 2. **Parmesan Cheese** – A good source of calcium and protein. 3. **Chicken** – A lean protein source that is rich in essential amino acids. 4. **Olive Oil** – A healthy fat that is rich in monounsaturated fatty acids. 5. **Zest** – Adds a burst of flavor |
As you can see, the result takes into account both the image and audio data.
That’s all you need to build a multi-modal RAG system. You can change the file and code to accommodate your needs.
Conclusion
Retrieval-augmented generation, or RAG, is a framework that enhances LLM output using external knowledge. In multi-modal RAG systems, we utilize data other than simple text, such as image and audio data.
In this article, we have implemented multi-modal RAG using text, audio, and image data. We are using CLIP for image embeddings, Whisper for audio transcription, SentenceTransformer for text embeddings, ChromaDB for vector storage, and Qwen-VL for multimodal text generation.
I hope this has helped!