Image by Author | Ideogram
Â
Qwen2.5-Omni is a multi-modal, end-to-end AI model that accepts inputs in diverse formats like text, audio, image, and video, and generates text and speech responses in natural language. Hugging Face’s Transformers library provides access to many kinds of AI models beyond just language models, and Qwen2-5-Omni is among them.
Some of the possible end-to-end use cases that this almighty model allows to perform include:
- Real-Time Voice and Video Chat: Qwen-2.5 Omni allows real-time interactions across text, audio, and video inputs, which greatly support applications based on virtual assistants and customer service agents.
- Robust Natural Speech Generation: This model can generate speech responses that sound natural, outperforming existing alternatives. This makes it an attractive option for applications requiring high-quality text-to-speech capabilities.
- Following Multi-Modal Instructions: complex instructions that involve multiple modalities can be supported, for example, simultaneously understanding a video tutorial and providing step-by-step guidance, or analyzing an image and responding with key information about it.
As powerful as it is, bear in mind it requires considerable computing resources to run in most environments; hence, we will show a practical demonstration of how to load, configure, and utilize it in a comparatively simpler text generation scenario.
This article guides you throughout a demo project to set up and run an instance of this powerful multi-modal model in a Python script or notebook.
Â
Demo Project
Â
First, because Qwen2.5-Omni is a realtively new model as of the time of writing, we need to make sure we install the very latest version of the transformers library installed in our development environment, and ensuring any obsolete version is uninstalled first:
pip uninstall transformers
pip install git+https://github.com/huggingface/transformers
pip install accelerate
pip install qwen-omni-utils -U
Â
The next step is importing the essential classes for working with the Qwen 2.5 Omni model, which is a large language model, and loading the model itself, specifically an architecture designed for text generation tasks.
Notice we installed the updated version of qwen-omni-utils
, which helps guarantee compatibility with the latest version of the transformers
library and offers improved support for the Qwen family of models, including utility functions and optimized performance. The .from_pretrained(...)
initializes the model weights. As the full model’s name suggests, the architecture is defined by 7 billion parameters.
from transformers import Qwen2_5OmniForConditionalGeneration, Qwen2_5OmniProcessor
from qwen_omni_utils import process_mm_info
model = Qwen2_5OmniForConditionalGeneration.from_pretrained("Qwen/Qwen2.5-Omni-7B", torch_dtype="auto", device_map="auto")
processor = Qwen2_5OmniProcessor.from_pretrained("Qwen/Qwen2.5-Omni-7B")
Â
We will encapsulate the process of generating a response given a prompt through a custom defined function, generate_response()
.
def generate_response(prompt, max_length=256):
inputs = processor(text=prompt, return_tensors="pt").to(model.device)
outputs = model.generate(
**inputs,
max_new_tokens=max_length,
do_sample=True,
temperature=0.7,
top_p=0.9,
)
response = processor.batch_decode(outputs, skip_special_tokens=True)[0]
if response.startswith(prompt):
response = response[len(prompt):].strip()
return response
Â
This function:
- Uses a processor instance to process the input prompt
- Generates the model outputs, passing in the processed input and setting relevant hyperparameters like model temperature and top-p
- Decodes the response and check if the initial prompt is part of it, in which case it is trimmed for a more meaningful response
- Returns the final response
To finalize, we build the main instructions to try the demo out.
prompt = "Explain multimodal AI models in simple terms."
print("\nGenerating response, please wait...")
response = generate_response(prompt)
print("\nPrompt:", prompt)
print("\nResponse:", response)
print("\n\n--- Interactive Demo ---")
print("Enter your prompt (type 'exit' to quit):")
while True:
user_prompt = input("> ")
if user_prompt.lower() == 'exit':
break
response = generate_response(user_prompt)
print("\nResponse:", response)
print("\nEnter your next prompt (type 'exit' to quit):")
Â
We formulate a text prompt asking the model to explain a complex concept, and call our defined function, called generate_response
. After that, we set up a loop to allow the user to send a follow-up prompt, thereby emulating a conversational agent application.
Importantly, notice that this code may take a long time to execute the first time, due to several factors: model size (recall it is a 7-billion parameter model), need for forward pass compilation in the first use of the loaded model for inference, and limited resources in the running environment we are using. The model needs in fact to be fully loaded in the GPU to start running inference for generation. This said, after the first, subsequent user-model interactions should take considerably less time.
Here is an example response obtained from the model:
Well, you know, quantum computing is kind of like regular computing but on a whole different level. In normal computers, data is processed using bits that can be either 0 or 1. But in quantum computers, they use qubits. These qubits can be both 0 and 1 at the same time, which is called superposition. Also, there’s something called entanglement where two qubits can be linked together so that the state of one affects the other no matter how far apart they are. This allows quantum computers to do some calculations much faster than regular computers for certain tasks.If you want to know more about it, like specific applications or how it compares to classical computing in more detail, just let me know.
Â
Wrapping Up
Â
This article introduced the Qwen2.5-Omni model, outlined its capabilities for multimodal generation tasks, and showed a simple demo to load, set up the model, and use it for text generation.
Ah, and I almost forgot! In case you run out of time, resources, (or patience) to download and run this vast model on your machine or cloud instance, you can always try out a demo with multiple input types here.
Â
Â
Iván Palomares Carrascosa is a leader, writer, speaker, and adviser in AI, machine learning, deep learning & LLMs. He trains and guides others in harnessing AI in the real world.