LLMOps
Speed up your LLM inference
The transformer architecture is arguably one of the most impactful innovations in modern deep learning. Proposed in the famous 2017 paper “Attention Is All You Need,” it has become the go-to approach for most language-related modeling, including all Large Language Models (LLMs), such as the GPT family, as well as many computer vision tasks.
As the complexity and size of these models grow, so does the need to optimize their inference speed, especially in chat applications where the users expect immediate replies. Key-value (KV) caching is a clever trick to do just that — let’s see how it works and when to use it.
Before we dive into KV caching, we will need to take a short detour to the attention mechanism used in transformers. Understanding how it works is required to spot and appreciate how KV caching optimizes transformer inference.
We will focus on autoregressive models used to generate text. These so-called decoder models include the GPT family, Gemini, Claude, or GitHub Copilot. They are trained on a simple task: predicting the next token in sequence. During inference, the model is provided with some text, and its task is…