Apollo and Design Choices of Video Large Multimodal Models (LMMs) | by Matthew Gunton | Jan, 2025

Let’s explore major design choices from Meta’s Apollo paper

11 min read

13 hours ago

As we’ve been anticipating, models are becoming increasingly capable of understanding different types of inputs. We’ve seen image transformer models (see my blogs on fine-tuning Flux and the research behind MM1) and now we’re beginning to see video models hit the scene.

In December of 2024, Meta unveiled their new Apollo family of models. When they unveiled these, they also published a paper detailing their research and work around Large Multimodal Models (LMMs). The paper is full of great details, so rather than try to cover it all I’ll be focusing on the 4 major design choices they highlighted when making their model.

Let’s dive in!

Embedding

Let’s first layout some quick ideas that are important to understand what’s going on here. Every Transformer relies on embeddings for its input. However, user input is typically first converted from something user-understood (text, videos) to tokens and then embeddings. To convert to embeddings, we use an embedding model. For multi-modal inputs, we typically use a different encoder for each input type.

Apollo and Design Choices of Video Large Multimodal Models (LMMs) | by Matthew Gunton | Jan, 2025

Let’s explore major design choices from Meta’s Apollo paper

Embedding

Recent Articles

CISOs rücken näher an den Vorstand

Positioning Text Around Elements With CSS Offset

Collaborative Intelligence: Maximizing Human-AI Partnerships in the Workplace

Mark Zuckerberg wants you to know he has a big AI data center, too

Mark Zuckerberg wants you to know he has a big AI data center, too

Related Stories

Leave A Reply Cancel reply