A Walkthrough of Nvidia’s Latest Multi-Modal LLM Family | by Mengliu Zhao | Oct, 2024

From LLaVA, Flamingo, to NVLM

Multi-modal LLM development has been advancing fast in recent years.

Although the commercial multi-modal models like GPT-4v, GPT-4o, Gemini, and Claude 3.5 Sonnet are the most eye-catching performers these days, the open-source models such as LLaVA, Llama 3-V, Qwen-VL have been steadily catching up in terms of performance on public benchmarks.

Just last month, Nvidia released their open-source multi-modal LLM family called NVLM. The family comprises three architectures: a) decoder-based, b) cross-attention-based, and c) hybrid. The decoder-based model takes both the image and text tokens to a pre-trained LLM, such as the LLaVA model. The cross-attention-based model uses the image token embeddings as the keys and values while using the text token embeddings as the queries; since the attention is calculated using different sources, it’s called “cross-attention” as in the original transformer decoder rather than the self-attention as in decoder-only models. The hybrid architecture is a unique design merging the decoder and cross-attention architecture for the benefit of multi-modal reasoning, fewer training parameters, and taking high-resolution input. The 72B decoder-based NVLM-D model achieved an impressive performance, beating state-of-the-art open-source and commercial models on tasks like natural image understanding and OCR.

In this article, I’m going to walk through the following things:

the dynamic high-resolution (DHR) vision encoder, which all the NVLM models adopt
the decoder-based model, NVLM-D, compared to LLaVA
the gated cross-attention model, NVLM-X, compared to Flamingo
the hybrid model, NVLM-H

In the end, I’ll show the NVLM-D 72B performance. Compared to state-of-the-art open-source and commercial models, the NVLM-D model shows stability over text-based tasks and superior performance on natural understanding and OCR tasks.