Automatically create domain-specific datasets in any language using LLMs
In this article I will show you how to create your own RAG dataset consisting of contexts, questions, and answers from documents in any language.
Retrieval-Augmented Generation (RAG) [1] is a technique that allows LLMs to access an external knowledge base.
By uploading PDF files and storing them in a vector database, we can retrieve this knowledge via a vector similarity search and then insert the retrieved text into the LLM prompt as additional context.
This provides the LLM with new knowledge and reduces the possibility of the LLM making up facts (hallucinations).
However, there are many parameters we need to set in a RAG pipeline, and researchers are always suggesting new improvements. How do we know which parameters to choose and which methods will really improve performance for our particular use case?
This is why we need a validation/dev/test dataset to evaluate our RAG pipeline. The dataset should be from the domain we are interested…