Integrating Text and Images for Smarter Data Classification | by Youness Mansar | Nov, 2024


Towards Data Science

A technical walk-through on leveraging multi-modal AI to classify mixed text and image data, including detailed instructions, executable code examples, and tips for effective implementation.

Photo by Tschernjawski Sergej on Unsplash

In AI, one of the most exciting areas of growth is multimodal learning, where models process and combine different types of data — such as images and text — to better understand complex scenarios. This approach is particularly useful in real-world applications where information is often split between text and visuals.

Take e-commerce as an example: a product listing might include an image showing what an item looks like and a description providing details about its features. To fully classify and understand the product, both sources of information need to be considered together. Multimodal large language models (LLMs) like Gemini 1.5, Llama 3.2, Phi-3 Vision, and open-source tools such as LlaVA, DocOwl have been developed specifically to handle these types of inputs.

Why Multimodal Models Are Important

Information from images and text can complement each other in ways that single-modality systems might miss:

  • A product’s description might mention its dimensions or material, which isn’t clear from the image alone.
  • On the other hand, an image might reveal key aspects like style or color that text can’t adequately describe.

If we only process images or text separately, we risk missing critical details. Multimodal models address this challenge by combining both sources during processing, resulting in more accurate and useful outcomes.

What You’ll Learn in This Tutorial

This tutorial will guide you through creating a pipeline designed to handle image-text classification. You’ll learn how to process and analyze inputs that combine visual and textual elements, achieving results that are more accurate than those from text-only systems.

If your project involves text-only classification, you might find my other blog post helpful — it focuses specifically on those methods.

To successfully build a multimodal image-text classification system, we’ll need three essential components. Here’s a breakdown of each element:

1. A Reliable LLM Provider

The backbone of this tutorial is a hosted LLM as a service. After experimenting with several options, I found that not all LLMs deliver consistent results, especially when working with structured outputs. Here’s a summary of my experience:

  • Groq and Fireworks.ai: These platforms offer multimodal LLMs in a serverless, pay-per-token format. While they seem promising, their APIs had issues following structured output requests. For example, when sending a query with a predefined schema, the returned output didn’t adhere to the expected format, making them unreliable for tasks requiring precision. Groq’s Llama 3.2 is still in preview so maybe I’ll try them again later. Fireworks.ai don’t typically respond to bug reports so I’ll just remove them from my options from now on.
  • Gemini 1.5: After some trial and error, I settled on Gemini 1.5. It consistently returned results in the desired format and has been working very ok so far. Though it still has its own weird quirks that you will find if you poke at it long enough (like the fact that you can’t use enums that are too large…). We will discuss them later in the post. This will be the LLM we use for this tutorial.

2. The Python Library: LangChain

To interface with the LLM and handle multimodal inputs, we’ll use the LangChain library. LangChain is particularly well-suited for this task because it allows us to:

  • Inject both text and image data as input to the LLM.
  • Defines common abstraction for different LLM as a service providers.
  • Define structured output schemas to ensure the results match the format we need.

Structured outputs are especially important for classification tasks, as they involve predefined classes that the output must conform to. LangChain ensures this structure is enforced, making it ideal for our use case.

3. The Classification Task: Keyword Suggestion for Photography Images

The task we’ll focus on in this tutorial is keyword suggestion for photography-related images. This is a multi-label classification problem, meaning that:

  • Each image can belong to more than one class simultaneously.
  • The list of possible classes is predefined.

For instance, an input consisting of an image and its description might be classified with keywords like landscape, sunset, and nature. While multiple keywords can apply to a single input, they must be selected from the predefined set of classes.

Now that we have the foundational concepts covered, let’s dive into the implementation. This step-by-step guide will walk you through configuring Gemini 1.5, setting up LangChain, and building a keyword suggestion system for photography-related images.

Step 1: Obtain Your Gemini API Key

The first step is to get your Gemini API key, which you can generate in Google AI Studio. Once you have your key, export it to an environment variable called GOOGLE_API_KEY. You can either:

GOOGLE_API_KEY=your_api_key_here
  • Export it directly in your terminal:
export GOOGLE_API_KEY=your_api_key_here

Step 2: Install and Initialize the Client

Next, install the necessary libraries:

pip install langchain-google-genai~=2.0.4 langchain~=0.3.6

Once installed, initialize the client:

import os
from langchain_google_genai import ChatGoogleGenerativeAI

GOOGLE_MODEL_NAME = os.environ.get("GOOGLE_MODEL_NAME", "gemini-1.5-flash-002")

llm_google_client = ChatGoogleGenerativeAI(
model=GOOGLE_MODEL_NAME,
temperature=0,
max_retries=10,
)

Step 3: Define the Output Schema

To ensure the LLM produces valid, structured results, we use Pydantic to define an output schema. This schema acts as a filter, validating that the categories returned by the model match our predefined list of acceptable values.

from typing import List, Literal
from pydantic import BaseModel, field_validator

def generate_multi_label_classification_model(list_classes: list[str]):
assert list_classes # Ensure classes are provided

class ClassificationOutput(BaseModel):
category: List[Literal[tuple(list_classes)]]

@field_validator("category", mode="before")
def filter_invalid_categories(cls, value):
if isinstance(value, list):
return [v for v in value if v in list_classes]
return [] # Return an empty list if input is invalid

return ClassificationOutput

Why field_validator Is Needed as a Workaround:

While defining the schema, we encountered a limitation in Gemini 1.5 (and similar LLMs): they do not strictly enforce enums. This means that even though we provide a fixed set of categories, the model might return values outside this set. For example:

  • Expected: ["landscape", "forest", "mountain"]
  • Returned: ["landscape", "ocean", "sun"] (with “ocean” and “sun” being invalid categories)

Without handling this, the invalid categories could cause errors or degrade the classification’s accuracy. To address this, the field_validator method is used as a workaround. It acts as a filter, ensuring:

  1. Only valid categories from list_classes are included in the output.
  2. Invalid or unexpected values are removed.

This safeguard ensures the model’s results align with the task’s requirements. It is annoying we have to do this but it seems to be a common issue for all LLM providers I tested, if you know of one that handles Enums well let me know please.

Next, bind the schema to the client for structured output handling:

list_classes = [
"shelter", "mesa", "dune", "cave", "metropolis",
"reef", "finger", "moss", "pollen", "daisy",
"fire", "daisies", "tree trunk", # Add more classes as needed
]

categories_model = generate_multi_label_classification_model(list_classes)
llm_classifier = llm_google_client.with_structured_output(categories_model)

Step 5: Build the Query and Call the LLM

Define the prediction function to send image and text inputs to the LLM:

...
def predict(self, text: str = None, image_url: str = None) -> list:
assert text or image_url, "Provide either text or an image URL."

content = []

if text:
content.append({"type": "text", "text": text})

if image_url:
image_data = base64.b64encode(httpx.get(image_url).content).decode("utf-8")
content.append(
{
"type": "image_url",
"image_url": {"url": f"data:image/jpeg;base64,{image_data}"},
}
)

prediction = self.llm_classifier.invoke(
[SystemMessage(content=self.system_prompt), HumanMessage(content=content)]
)

return prediction.category

To send image data to the Gemini LLM API, we need to encode the image into a format the model can process. This is where base64 encoding comes into play.

What is Base64?

Base64 is a binary-to-text encoding scheme that converts binary data (like an image) into a text format. This is useful when transmitting data that might otherwise be incompatible with text-based systems, such as APIs. By encoding the image into base64, we can include it as part of the payload when sending data to the LLM.

Step 6: Get Results as Multi-Label Keywords

Finally, run the classifier and see the results. Let’s test it with an example:

Example Input 1:

Photo by Calvin Ma on Unsplash

classic red and white bus parked beside road

Result:

['transportation', 'vehicle', 'road', 'landscape', 'desert', 'rock', 'mountain']
['transportation', 'vehicle', 'road']

As shown, when using both text and image inputs, the results are more relevant to the actual content. With text-only input, the LLM gave correct but incomplete values.

Example Input 2:

Photo by Tadeusz Lakota on Unsplash

black and white coated dog

Result:

['animal', 'mammal', 'dog', 'pet', 'canine', 'wildlife']

Text Only:

['animal', 'mammal', 'canine', 'dog', 'pet']

Multimodal classification, which combines text and image data, provides a way to create more contextually aware and effective AI systems. In this tutorial, we built a keyword suggestion system using Gemini 1.5 and LangChain, tackling key challenges like structured output handling and encoding image data.

By blending text and visual inputs, we demonstrated how this approach can lead to more accurate and meaningful classifications than using either modality alone. The practical examples highlighted the value of combining data types to better capture the full context of a given scenario.

This tutorial focused on text and image classification, but the principles can be applied to other multimodal setups. Here are some ideas to explore next:

  • Text and Video: Extend the system to classify or analyze videos by integrating video frame sampling along with text inputs, such as subtitles or metadata.
  • Text and PDFs: Develop classifiers that handle documents with rich content, like scientific papers, contracts, or resumes, combining visual layouts with textual data.
  • Real-World Applications: Integrate this pipeline into platforms like e-commerce sites, educational tools, or social media moderation systems.

These directions demonstrate the flexibility of multimodal approaches and their potential to address diverse real-world challenges. As multimodal AI evolves, experimenting with various input combinations will open new possibilities for more intelligent and responsive systems.

Full code: llmclassifier/llm_multi_modal_classifier.py

Recent Articles

Related Stories

Leave A Reply

Please enter your comment!
Please enter your name here