How to Create a RAG Evaluation Dataset From Documents | by Dr. Leon Eversberg | Nov, 2024


Automatically create domain-specific datasets in any language using LLMs

Towards Data Science
The HuggingFace dataset card showing an example RAG evaluation dataset that we generated.
Our automatically generated RAG evaluation dataset on the Hugging Face Hub (PDF input file from the European Union licensed under CC BY 4.0). Image by the author

In this article I will show you how to create your own RAG dataset consisting of contexts, questions, and answers from documents in any language.

Retrieval-Augmented Generation (RAG) [1] is a technique that allows LLMs to access an external knowledge base.

By uploading PDF files and storing them in a vector database, we can retrieve this knowledge via a vector similarity search and then insert the retrieved text into the LLM prompt as additional context.

This provides the LLM with new knowledge and reduces the possibility of the LLM making up facts (hallucinations).

An overview of the RAG pipeline. For documents storage: input documents -> text chunks -> encoder model -> vector database. For LLM prompting: User question -> encoder model -> vector database -> top-k relevant chunks -> generator LLM model. The LLM then answers the question with the retrieved context.
The basic RAG pipeline. Image by the author from the article “How to Build a Local Open-Source LLM Chatbot With RAG”

However, there are many parameters we need to set in a RAG pipeline, and researchers are always suggesting new improvements. How do we know which parameters to choose and which methods will really improve performance for our particular use case?

This is why we need a validation/dev/test dataset to evaluate our RAG pipeline. The dataset should be from the domain we are interested…

Recent Articles

Related Stories

Leave A Reply

Please enter your comment!
Please enter your name here