Building RAG Systems with Transformers


Retrieval-Augmented Generation (RAG) has emerged as a powerful paradigm for enhancing the capabilities of large language models. By combining the strengths of retrieval systems with generative models, RAG systems can produce more accurate, factual, and contextually relevant responses. This approach is particularly valuable when dealing with domain-specific knowledge or when up-to-date information is required.

In this post, you will explore how to build a basic RAG system using models from the Hugging Face library. You’ll build each system component, from document indexing to retrieval and generation, and implement a complete end-to-end solution. Specifically, you will learn:

  • The RAG architecture and its components
  • How to build a document indexing and retrieval system
  • How to implement a transformer-based generator

Let’s get started!

Building RAG Systems with Transformers
Photo by Tina Nord. Some rights reserved.

Overview

This post is divided into five parts:

  • Understanding the RAG architecture
  • Building the Document Indexing System
  • Implementing the Retrieval System
  • Implementing the Generator
  • Building the Complete RAG System

Understanding the RAG Architecture

An RAG system consists of two main components:

  1. Retriever: Responsible for finding relevant documents or passages from a knowledge base given a query.
  2. Generator: Uses the retrieved documents and the original query to generate a coherent and informative response.

Each of these components has many fine details. You need RAG because the generator alone (i.e., the language model) cannot generate accurate and contextually relevant responses, which are known as hallucinations. Therefore, you need the retriever to provide hints to help the generator.

This approach combines generative models’ broad language understanding capabilities with the ability to access specific information from a knowledge base. This results in responses that are both fluent and factually accurate.

Let’s implement each component of a RAG system step by step.

Building the Document Indexing System

The first step in creating a RAG system is to build a document indexing system. This system must encode documents into dense vector representations and store them in a database. Then, we can retrieve the documents based on contextual similarity. This means you need to be able to search by vector similarity metrics, not exact matches. This is a key point – not all database systems can be used to build a document indexing system.

Of course, you could collect documents, encode them into vector representations, and keep them in memory. When retrieval is requested, you could compute the similarity one by one to find the closest match. However, checking each vector in a loop is inefficient and not scalable. FAISS is a library that is optimized for this task. To install FAISS, you can compile it from source or use the pre-compiled version from PyPI:

In the following, you’ll create a language model to encode documents into dense vector representations and store them in a FAISS index for efficient retrieval:

The key part of this code is the generate_embedding() function. It takes a list of documents, encodes them through the model, and returns a dense vector representation using mean pooling over all token embeddings from each document. The document does not need to be long and complete. A sentence or paragraph is expected because the models have a context window limit. Moreover, you will see later in another example that a very long document is not ideal for RAG.

You used a pre-trained Sentence Transformer model, sentence-transformers/all-MiniLM-L6-v2, which is specifically designed for generating sentence embeddings. You do not keep the original document in the FAISS index; you only keep the embedding vectors. You pre-build the L2 distance index among these vectors for efficient similarity search.

You may modify this code for different implementations of the RAG system. For example, the dense vector representation is obtained by mean pooling. Still, you can just use the first token since the tokenizer prepends the [CLS] token to each sentence, and the model is supposed to produce the context embedding over this special token. Moreover, L2 distance is used here because you declared the FAISS index intending to use it with the L2 metric. There is no cosine similarity metric in FAISS, but L2 and cosine distance are similar. Note that, with normalized vectors,

$$
\begin{align}
\Vert \mathbf{x} – \mathbf{y} \Vert_2^2
&= (\mathbf{x} – \mathbf{y})^\top (\mathbf{x} – \mathbf{y}) \\
&= \mathbf{x}^\top \mathbf{x} – 2 \mathbf{x}^\top \mathbf{y} + \mathbf{y}^\top \mathbf{y} \\
&= 2 – 2 \mathbf{x}^\top \mathbf{y} \\
&= 2 – 2 \cos \theta
\end{align}
$$

Therefore, L2 distance is equivalent to cosine distance when the vectors are normalized (as long as you remember that when dissimilarity increases, L2 runs from 0 to infinity, but cosine distance decreases from +1 to -1). If you intended to use cosine distance, you should modify the code to become:

Essentially, you scaled each embedding vector to make it unit length.

Implementing the Retrieval System

With the documents indexed, let’s see how you can retrieve some of the most relevant documents for a given query:

If you run this code, you will see the following output:

In the function retrieve_documents(), you provide the query string, the FAISS index, and the document collection. You then generate the embedding for the query just like you did for the documents. Then, you leverage the search() method of the FAISS index to find the k most similar documents to the query embedding. The search() method returns two arrays:

  • distances: The distances between the query embedding and the indexed embeddings. Since this is how you defined the index, these are the L2 distances.
  • indices: The indices of the indexed embeddings that are most similar to the query embedding, matching the distances array.

You can use these arrays to retrieve the most similar documents from the original collection. Here, you use the indices to get the documents from the list. Afterward, you print the retrieved documents along with their distances from the query in the embedding space in descending order of relevance or increasing distance.

Note that the document’s context vector is supposed to represent the entire document. Therefore, the distance between the query and the document may be large if the document contains a lot of information. Ideally, you want the documents to be focused and concise. If you have a long text, you may want to split it into multiple documents to make the RAG system more accurate.

This retrieval system forms the first component of our RAG architecture. Given a user query, it allows us to find relevant information from our knowledge base. There are many other ways to implement the same functionality, but this highlights the key idea of vector search.

Implementing the Generator

Next, let’s implement the generator component of our RAG system.

It is a prompt engineering problem. While the user provides a query, you first retrieve the most relevant documents from the retriever and create a new prompt that includes the user’s query and the retrieved documents as context. Then, you use a pre-trained language model to generate a response based on the new prompt.

Here is how you can implement it:

This is the generator component of our RAG system. You instantiate the pre-trained T5 model (small version, but you can pick a larger one or a different model that fits to run on your system). This model is a sequence-to-sequence model that generates a new sequence from a given sequence. If you use a different model, such as the “causal LM” model, you may need to change the prompt to make it work more efficiently.

In the generate_response() function, you combine the query and the retrieved documents into a single prompt. Then, you use the T5 model to generate a response. You can adjust the generation parameters to make it work better. In the above, only beam search is used for simplicity. The model’s output is then decoded to a text string as the response. Since you combined multiple documents into a single prompt, you need to be careful that the prompt does not exceed the context window of the model.

The generator leverages the information from the retrieved documents to produce a fluent and factually accurate response. The model behaves vastly differently when you just pose the query without context.

Building the Complete RAG System

That’s all you need to build a basic RAG system. Let’s create a function to wrap up the retrieval and generation components:

Then you can use the RAG pipeline in a loop to generate responses for a set of queries:

You can see that the queries are answered one by one in a loop. The set of documents, however, is prepared in advance and reused for all queries. This is how an RAG system typically works.

The complete code of all the above is as follows:

This code is self-contained. All the documents and queries are defined in the code. This is a starting point, and you may extend it for new features, such as saving the indexed documents in a file that you can load later without re-indexing every time.

Further Readings

Below are some further readings that you may find useful:

Summary

This post explored building a Retrieval-Augmented Generation (RAG) system using transformer models from the Hugging Face library. We’ve implemented each system component, from document indexing to retrieval and generation, and combined them into a complete end-to-end solution.

RAG systems represent a powerful approach to enhancing the capabilities of language models by grounding them in external knowledge. RAG systems can produce more accurate, factual, and contextually relevant responses by retrieving relevant information and incorporating it into the generation process.

Recent Articles

Related Stories

Leave A Reply

Please enter your comment!
Please enter your name here