Let’s explore major design choices from Meta’s Apollo paper
13 hours ago
As we’ve been anticipating, models are becoming increasingly capable of understanding different types of inputs. We’ve seen image transformer models (see my blogs on fine-tuning Flux and the research behind MM1) and now we’re beginning to see video models hit the scene.
In December of 2024, Meta unveiled their new Apollo family of models. When they unveiled these, they also published a paper detailing their research and work around Large Multimodal Models (LMMs). The paper is full of great details, so rather than try to cover it all I’ll be focusing on the 4 major design choices they highlighted when making their model.
Let’s dive in!
Embedding
Let’s first layout some quick ideas that are important to understand what’s going on here. Every Transformer relies on embeddings for its input. However, user input is typically first converted from something user-understood (text, videos) to tokens and then embeddings. To convert to embeddings, we use an embedding model. For multi-modal inputs, we typically use a different encoder for each input type.