Qwen Releases the Qwen2.5-VL-32B-Instruct: A 32B Parameter VLM that Surpasses Qwen2.5-VL-72B and Other Models like GPT-4o Mini


​In the evolving field of artificial intelligence, vision-language models (VLMs) have become essential tools, enabling machines to interpret and generate insights from both visual and textual data. Despite advancements, challenges remain in balancing model performance with computational efficiency, especially when deploying large-scale models in resource-limited settings.​

Qwen has introduced the Qwen2.5-VL-32B-Instruct, a 32-billion-parameter VLM that surpasses its larger predecessor, the Qwen2.5-VL-72B, and other models like GPT-4o Mini, while being released under the Apache 2.0 license. This development reflects a commitment to open-source collaboration and addresses the need for high-performing yet computationally manageable models.​

Technically, the Qwen2.5-VL-32B-Instruct model offers several enhancements:​

  • Visual Understanding: The model excels in recognizing objects and analyzing texts, charts, icons, graphics, and layouts within images.​
  • Agent Capabilities: It functions as a dynamic visual agent capable of reasoning and directing tools for computer and phone interactions.​
  • Video Comprehension: The model can understand videos over an hour long and pinpoint relevant segments, demonstrating advanced temporal localization.​
  • Object Localization: It accurately identifies objects in images by generating bounding boxes or points, providing stable JSON outputs for coordinates and attributes.​
  • Structured Output Generation: The model supports structured outputs for data like invoices, forms, and tables, benefiting applications in finance and commerce.​

These features enhance the model’s applicability across various domains requiring nuanced multimodal understanding. ​

Empirical evaluations highlight the model’s strengths:​

  • Vision Tasks: On the Massive Multitask Language Understanding (MMMU) benchmark, the model scored 70.0, surpassing the Qwen2-VL-72B’s 64.5. In MathVista, it achieved 74.7 compared to the previous 70.5. Notably, in OCRBenchV2, the model scored 57.2/59.1, a significant improvement over the prior 47.8/46.1. In Android Control tasks, it achieved 69.6/93.3, exceeding the previous 66.4/84.4.​
  • Text Tasks: The model demonstrated competitive performance with a score of 78.4 on MMLU, 82.2 on MATH, and an impressive 91.5 on HumanEval, outperforming models like GPT-4o Mini in certain areas.​

These results underscore the model’s balanced proficiency across diverse tasks. ​

In conclusion, the Qwen2.5-VL-32B-Instruct represents a significant advancement in vision-language modeling, achieving a harmonious blend of performance and efficiency. Its open-source availability under the Apache 2.0 license encourages the global AI community to explore, adapt, and build upon this robust model, potentially accelerating innovation and application across various sectors.


Check out the Model Weights. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 85k+ ML SubReddit.


Nikhil is an intern consultant at Marktechpost. He is pursuing an integrated dual degree in Materials at the Indian Institute of Technology, Kharagpur. Nikhil is an AI/ML enthusiast who is always researching applications in fields like biomaterials and biomedical science. With a strong background in Material Science, he is exploring new advancements and creating opportunities to contribute.

Recent Articles

Related Stories

Leave A Reply

Please enter your comment!
Please enter your name here