Hand computing a fundamental component of multimodal models
Cross Attention is a fundamental tool in creating AI models that can understand multiple forms of data simultaneously. Think language models that can understand images like the ones used in ChatGPt, or models that generate video based on text like Sora.
This summary goes over all critical mathematical operations within cross attention, allowing you to understand its inner workings at a fundamental level.
Cross attention is used when modeling with a variety of data types, each of which might format the input differently. For natural language data one would likely use a word to vector embedding, paired with positional encoding, to calculate a vector that represents each word.
For visual data, one might pass the image through an encoder specifically designed to summarize the image into a vector representation.