Building a RAG Pipeline with llama.cpp in Python


Building a RAG Pipeline with llama.cpp in Python
Image by Editor | Midjourney

Using llama.cpp enables efficient and accessible inference of large language models (LLMs) on local devices, particularly when running on CPUs. This article takes this capability to a full retrieval augmented generation (RAG) level, providing a practical, example-based guide to building a RAG pipeline with this framework using Python.

Step-by-Step Process

First, we install the necessary packages:

Bear in mind the initial components setup will take a few minutes to complete if none were installed before in your running environment.

After installing llama.cpp, Langchain, and other components such as pypdf for handling PDF documents in document corpus, it’s time to import all we need.

Time to get started with the real process. The first thing we need is locally downloading an LLM. Even though in a real scenario you may want a bigger LLM, to make our example relatively lightweight, we will load a relatively smaller LLM (I know, this just sounded contradictory!), namely the Llama 2 7B quantized model, available from Hugging Face:

Intuitively, we now need to set up another major component in any RAG system: the document base. In this example, we will create a mechanism to read documents in multiple formats, including .doc and .txt, and for simplicity we will provide a default sample text document built on the fly, adding it to our newly created documents directory, docs. To try it yourself with an extra level of fun, make sure you load actual documents of your own.

Notice that after processing the documents, we split them into chunks, which is a common practice in RAG systems for enhancing retrieval accuracy and ensuring the LLM effectively processes manageable inputs within its context window.

Both LLMs and RAG systems need to handle numerical representations of text rather than raw text, therefore, we next build a vector store that contains embeddings of our text documents. Chroma is a lightweight, open-source vector database for efficiently storing and querying embeddings.

Now llama.cpp enters the scene for initializing our previously downloaded LLM. To do this, a LlamaCpp object is instantiated with the model path and other settings like model temperature, maximum context length, and so on.

We are getting closer to the inference show, and just a few actors remain to appear on stage. One is the RAG prompt template, which is an elegant way to define how the retrieved context and user query are combined into a single, well-structured input for the LLM during inference.

Finally, we put everything together to create our RAG pipeline based on llama.cpp.

Let’s review the building blocks of the RAG pipeline we just created for a better understanding:

  • llm: the LLM downloaded and then initialized using llama.cpp
  • chain_type: a method to specify how the retrieved documents in an RAG system are put together and sent to the LLM, with "stuff" meaning that all retrieved context is injected in the prompt.
  • retriever: initialized upon the vector store and configured to get the three most relevant document chunks.
  • return_source_documents=True: used to obtain information about which document chunks were used to answer the user’s question.
  • chain_type_kwargs="prompt": prompt: enables the use of our recently defined custom template to format the retrieval-augmented input into a presentable format for the LLM.

To finalize and see everything in action, we define and utilize a pipeline-driving function, ask_question(), that runs the RAG pipeline to answer the user’s questions.

Now let’s try out our pipeline with some specific questions.

Result:

Wrapping Up

This article demonstrated how to set up and utilize a local RAG pipeline efficiently using llama.cpp, a popular framework for running inference on existing LLM locally in a lightweight and portable fashion. You should now be able to apply these newly-learned skills in your own projects.

Recent Articles

Related Stories

Leave A Reply

Please enter your comment!
Please enter your name here