Multimodal RAG Implementation with Hugging Face

Image by Author | Ideogram

Large language models (LLM) have changed many ways people work. The model easily generates complex text with simple input, and this technology has become standard for many applications, such as chatbots, planner generators, etc.

However, LLM could hallucinate, meaning the model output is wrong and factual information is not produced. That’s why a technique called retrieval-augmented generation (RAG) has been developed to enhance the LLM output.

RAG is a technique that combines retrieval-based methods with LLM to enhance the response. By fetching the appropriate text or document from the external knowledge base, the LLM can use the retrieved data to generate the appropriate result.

Classically, RAG works only by retrieving and generating text data. However, few models have now been developed to allow for multimodal function.

This article will explore how to develop multimodal RAG implementation with Hugging Face, especially for visual and text data.

Let’s get into it.

Multimodal RAG Implementation

In this tutorial, we will use Google Colab with access to the GPU. More specifically, we will use the A100 GPU as the RAM necessity for this article is quite high.

Let’s start by installing the necessary Python packages. Run the following code for the installation.

!pip install byaldi pdf2image qwen-vl-utils transformers

With the code installed fine, we will build our Knowledge Base. We will use several PDF guide collections for building design for this example.

import requests
import os

pdfs = 
    "Window": "https://www.westoxon.gov.uk/media/ksqgvl4b/10-design-guide-windows-and-doors.pdf",
    "Roofs": "https://www.westoxon.gov.uk/media/d3ohnpd1/9-design-guide-roofs-and-roofing-materials.pdf",
    "Extensions": "https://www.westoxon.gov.uk/media/pekfogvr/14-design-guide-extensions-and-alterations.pdf",
    "Greener": "https://www.westoxon.gov.uk/media/thplpsay/16-design-guide-greener-traditional-buildings.pdf",
    "Sustainable": "https://www.westoxon.gov.uk/media/nk5bvv0v/12-design-guide-sustainable-building-design.pdf"


output_dir = "dataset"
os.makedirs(output_dir, exist_ok=True)

for name, url in pdfs.items():
    response = requests.get(url)
    pdf_path = os.path.join(output_dir, f"name.pdf")


    with open(pdf_path, "wb") as f:
        f.write(response.content)

After we download all the files, we transform all the PDF pages into images. Our multimodal document-retrieval model needs to work if it is to represent the document as an image.

import os
from pdf2image import convert_from_path

def convert_pdfs_to_images(folder):
    pdf_files = [f for f in os.listdir(folder) if f.endswith('.pdf')]
    all_images = 

    for doc_id, pdf_file in enumerate(pdf_files):
        pdf_path = os.path.join(pdf_folder, pdf_file)
        images = convert_from_path(pdf_path, dpi=100)
        all_images[doc_id] = images

    return all_images

all_images = convert_pdfs_to_images("/content/dataset/")

All the documents will be transformed into an image file, so we can see their content in image format.

import matplotlib.pyplot as plt

fig, axes = plt.subplots(2, 4, figsize=(15, 10))

for i, ax in enumerate(axes.flat):
    img = all_images[0][i]
    ax.imshow(img)
    ax.axis('off')

plt.tight_layout()
plt.show()

Multimodal RAG Implementation with HuggingFace

Next, we will initialize the RAG system with Byaldi and the Document-Retrieval model ColPali. The ColPali model is a retrieval model that fetches the document by using the image directly instead of breaking it down into a text-chunking process.

We will use the Byaldi package, the ColPali simple wrapper, to facilitate the RAG implementation. Let’s use the code below for that.

from byaldi import RAGMultiModalModel

colpali_model = RAGMultiModalModel.from_pretrained("vidore/colpali-v1.2")

When the model has been downloaded, we will use the following code to index our image data and build the Knowledge Base.

colpali_model.index(
    input_path="dataset/",
    index_name="image_index",
    store_collection_with_index=False,
    overwrite=True
)

With the retrieval model ready, let’s try out how the model retrieves the documents from the text query.

query = "How should we design greener and sustainable house?"

results = colpali_model.search(query, k=2)
results

Output:

['doc_id': 1, 'page_num': 3, 'score': 12.0625, 'metadata': , 'base64': None,
 'doc_id': 1, 'page_num': 9, 'score': 11.875, 'metadata': , 'base64': None]

Let’s look at the documents retrieved from the above output.

import matplotlib.pyplot as plt

def get_result_images(results, all_images):
    grouped_images = []

    for result in results:
        doc_id = result['doc_id']
        page_num = result['page_num']
        grouped_images.append(all_images[doc_id][page_num - 1])
    return grouped_images
result_images = get_result_images(results, all_images)

fig, axes = plt.subplots(1, 2, figsize=(15, 10))

for i, ax in enumerate(axes.flat):
    img = grouped_images[i]
    ax.imshow(img)
    ax.axis('off')

plt.tight_layout()
plt.show()

Multimodal RAG Implementation with HuggingFace

The retrieval model successfully retrieves the most appropriate documents for our query.

Next, we will use the Qwen-VL for our generative model. Qwen-VL is a Vision Language Model that can understand our image and provide text output. To do that, we will use the following code.

from transformers import Qwen2VLForConditionalGeneration, Qwen2VLProcessor
from qwen_vl_utils import process_vision_info
import torch

vl_model = Qwen2VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2-VL-7B-Instruct",
    torch_dtype=torch.bfloat16,
)
vl_model.cuda().eval()

Next, we set up the Qwen-VL image processor and set the pixel size for GPU optimization.

min_pixels = 256*256
max_pixels = 1024*1024
vl_model_processor = Qwen2VLProcessor.from_pretrained(
    "Qwen/Qwen2-VL-7B-Instruct",
    min_pixels=min_pixels,
    max_pixels=max_pixels
)

Then, we will create our chat structure for our generative model.

chat_template = [
    
        "role": "user",
        "content": [
            
                "type": "image",
                "image": result_images[0],
            ,
             
                 "type": "image",
                "image": result_images[1],
            ,
            
                "type": "text",
                "text": query
            ,
        ],
    
]

text = vl_model_processor.apply_chat_template(
    chat_template, tokenize=False, add_generation_prompt=True
)

Lastly, we will set up the input processing system from the image and text to the output.

image_inputs, _ = process_vision_info(chat_template)
inputs = vl_model_processor(
    text=[text],
    images=image_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

When everything is ready, we will try out the Multimodal RAG system.

generated_ids = vl_model.generate(**inputs, max_new_tokens=100) 

generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = vl_model_processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)

print(output_text[0])

Output:

To design greener and sustainable houses, we should consider the following principles:

1. **Minimizing the use of scarce resources**: Use building materials, fossil fuels, and water efficiently.
2. **Economic operation**: Ensure the building is cost-effective throughout its life cycle and aligns with the needs of the local community.
3. **Energy and carbon efficiency**: Design the building to minimize energy consumption with effective insulation, heating, and cooling systems.
4. **Preserving and enhancing site character

The result is good and follows the PDF we provided previously. To minimize the results, we use a maximum of 100 tokens, but you can always increase the tokens. Also, I only use the top 2 image document results, which you can always increase to improve the output accuracy.

That’s all you need to know about initializing multimodal RAG. You can always try out other parameters and models to improve your results.

Conclusion

Retrieval-augmented generation, or RAG, is a technique that combines retrieval-based methods with LLM to enhance the response. Usually, it works only when using text data, but this article explores the possibility of using image data input.

By combining the ColPali and Qwen-VL series, we established a RAG system that accepts both image and text data and can answer our query.

I hope this has helped!

Cornellius Yudha Wijaya is a data science assistant manager and data writer. While working full-time at Allianz Indonesia, he loves to share Python and data tips via social media and writing media. Cornellius writes on a variety of AI and machine learning topics.

Multimodal RAG Implementation with Hugging Face

Multimodal RAG Implementation

Conclusion

Recent Articles

Building Robust ViewModels | Kodeco

Auto-Completion Style Text Generation with GPT-2 Model

Write for Towards Data Science

Silver Fox APT Uses Winos 4.0 Malware in Cyber Attacks Against Taiwanese Organizations

Zendaya Is No Longer Meechee, But She Is Shrek

Related Stories

Leave A Reply Cancel reply