On multi-modal LLM Flamingo’s vision encoder
Designing Multi-modal LLM is hard.
The state-of-the-art multi-modal LLMs are primarily based on existing LLM architectures, with modifications specifically addressing different sources of input, and that’s where the difficulty comes from. The latest Nvidia paper divides the commonly used multi-modal architectures into two categories:
- decoder-based;
- cross-attention-based.
One of my previous medium articles discussed the latest paper from Meta, using decoder-based architecture, which converts an input image into a latent vector using a VAE encoder to address the issue that the image space is continuous and different from the discrete text space.
However, the problem with cross-attention-based architecture is different. For example, in the multi-modal LLM model Flamingo, the critical issue is converting the vision embedding from a generic vision model of varying temporal and spatial dimensions into the cross-attention layer to match the language input dimension.
In this post, I will dive deep into Flamingo’s unique design on top of the vision encoder, the Perceiver Resampler, to explain how this issue was solved. Furthermore, I will explore the Perceiver Resampler’s origin — the Induced Set Attention Block from Set Transformer, which further inspired DeepMind’s Perceiver model for learning fixed-length latent embeddings from generic input data.