Understanding RAG Part V: Managing Context Length
Image by Editor | Midjourney & Canva
Be sure to check out the previous articles in this series:
Conventional large language models (LLMs) had context length limit, which restricts the amount of information processed in a single user-model interaction, as one of their major limitations. Addressing this limitation has been one of the main courses of action in the LLMs development community, raising awareness of the advantages of increasing context length in producing more coherent and accurate responses. For example, GPT-3 — released in 2020 — had a context length of 2048 tokens, while its younger but more powerful sibling GPT-4 Turbo — born in 2023 — allows a whooping 128K tokens in a single prompt. Needless to say this is equivalent to being able to process an entire book in a single interaction, for instance, to summarize it.
Retrieval augmented generation (RAG), on the other hand, incorporates external knowledge from retrieved documents (usually vector databases) to enhance the context and relevance of LLM outputs. Managing context length in RAG systems is, however, still a challenge, as in certain scenarios requiring substantial contextual information, efficient selection and summarization of retrieved information are necessitated to stay below the LLM’s input limit without losing essential knowledge.
Strategies for Long Context Management in RAG
There are several strategies for RAG systems to incorporate as much relevant retrieved knowledge as possible in the initial user query before passing it to the LLM, and stay within the model’s input limits. Four of them are outlined below, from simplest to most sophisticated.
1. Document Chunking
Document chunking is generally the simplest strategy, it focuses on splitting documents in the vector database into smaller chunks. Whilst it may not sound obvious at first glance, this strategy helps overcome the context length limitation of LLMs inside RAG systems in various ways, for instance by reducing the risk of retrieving redundant information while keeping contextual integrity in chunks.
2. Selective Retrieval
Selective retrieval consists of applying a filtering process on a large set of relevant documents to retrieve only the most highly relevant parts, narrowing down the size of the input sequence passed to the LLM. By intelligently filtering parts of the retrieved documents to be retained, its target is to avoid incorporating irrelevant or extraneous information.
3. Targeted Retrieval
While similar to selective retrieval, the essence of targeted retrieval is retrieving data with a very concrete intent or final response in mind. This is achieved by optimizing the retriever mechanisms for specific types of query or data sources, e.g. building specialized retrievers for medical texts, news articles, recent science breakthroughs, and so on. In short, it constitutes an evolved and more specialized form of selective retrieval with additional domain-specific criteria in the loop.
4. Context Summarization
Context summarization is a more sophisticated approach to manage context length in RAG systems, in which we apply text summarization techniques in the process of building the final context. One possible way to do this is by using an additional language model -often smaller and trained for summarization tasks- that summarizes large chunks of retrieved documents. This summarization task can be extractive or abstractive, the former identifying and extracting relevant text passages, and the latter generating from scratch a summary that rephrases and condenses the original chunks. Alternatively, some RAG solutions use heuristic methods to assess the relevance of pieces of text e.g. chunks, discarding less relevant ones.
Strategy | Summary |
---|---|
Document Chunking | Splits documents into smaller, coherent chunks to preserve context while reducing redundancy and staying within LLM limits. |
Selective Retrieval | Filters large sets of relevant documents to retrieve only the most pertinent parts, minimizing extraneous information. |
Targeted Retrieval | Optimizes retrieval for specific query intents using specialized retrievers, adding domain-specific criteria to refine results. |
Context Summarization | Uses extractive or abstractive summarization techniques to condense large amounts of retrieved content, ensuring essential information is passed to the LLM. |
Long-Context Language Models
And how about long-context LLMs? Wouldn’t that be enough, without the need for RAG?
That’s an important question to address. Long-context LLMs (LC-LLMs) are “extra-large” LLMs capable of accepting very long sequences of input tokens. Despite research evidence that LC-LLMs often outperform RAG systems, the latter still have particular advantages, most notably in scenarios requiring dynamic real-time information retrieval and cost efficiency. In these applications, it is worth pondering the use of a smaller LLM wrapped in an RAG system that use the above described strategies, instead of an LC-LLM. None of them are one-fits-all solutions, and both of them will be able to shine in particular settings they are suited for.
Wrapping Up
This article introduced and described four strategies for managing context length in RAG systems and dealing with long contexts in situations where LLMs in such systems might have limitations in the length of inputs they can accept in single user interactions. While the use of so-called Long-Context LLMs has recently become a trend to overcome this issue, there are situations when sticking to RAG systems might still be worth it, especially in dynamic information retrieval scenarios requiring real-time up-to-date contexts.