Evaluate RAGs Rigorously or Perish | by Jarek Grygolec, Ph.D. | Apr, 2024

The results presented in the Table 1 seem very appealing, at least to me. The simple evolution performs very well. In the case of the reasoning evolution the first part of question is answered perfectly, but the second part is left unanswered. Inspecting the Wikipedia page [3] it is evident that there is no answer to the second part of the question in the actual document, so it can also be interpreted as the restraint from hallucinations, a good thing in itself. The multi-context question-answer pair seems very good. The conditional evolution type is acceptable if we look at the question-answer pair. One way of looking at these results is that there is always space for better prompt engineering that are behind evolutions. Another way is to use better LLMs, especially for the critic role as is the default in the ragas library.


The ragas library is able to not only generate the synthetic evaluation sets, but also provides us with built-in metrics for component-wise evaluation as well as end-to-end evaluation of RAGs.

Picture 2: RAG Evaluation Metrics in RAGAS. Image created by the author in draw.io.

As of this writing RAGAS provides out-of-the-box eight metrics for RAG evaluation, see Picture 2, and likely new ones will be added in the future. In general you are about to choose the metrics most suitable for your use case. However, I recommend to select the one most important metric, i.e.:

Answer Correctness — the end-to-end metric with scores between 0 and 1, the higher the better, measuring the accuracy of the generated answer as compared to the ground truth.

Focusing on the one end-to-end metric helps to start the optimisation of your RAG system as fast as possible. Once you achieve some improvements in quality you can look at component-wise metrics, focusing on the most important one for each RAG component:

Faithfulness — the generation metric with scores between 0 and 1, the higher the better, measuring the factual consistency of the generated answer relative to the provided context. It is about grounding the generated answer as much as possible in the provided context, and by doing so prevent hallucinations.

Context Relevance — the retrieval metric with scores between 0 and 1, the higher the better, measuring the relevancy of retrieved context relative to the question.

RAG Factory

OK, so we have a RAG ready for optimisation… not so fast, this is not enough. To optimise RAG we need the factory function to generate RAG chains with given set of RAG hyperparameters. Here we define this factory function in 2 steps:

Step 1: A function to store documents in the vector database.

# Defining a function to get document collection from vector db with given hyperparemeters
# The function embeds the documents only if collection is missing
# This development version as for production one would rather implement document level check
def get_vectordb_collection(chroma_client,
chunk_size=None, overlap_size=0) -> ChromaCollection:

if chunk_size is None:
collection_name = "full_text"
docs_pp = documents
collection_name = f"{embedding_model}_chunk{chunk_size}_overlap{overlap_size}"

text_splitter = CharacterTextSplitter(

docs_pp = text_splitter.transform_documents(documents)

embedding = OpenAIEmbeddings(model=embedding_model)

langchain_chroma = Chroma(client=chroma_client,

existing_collections = [collection.name for collection in chroma_client.list_collections()]

if chroma_client.get_collection(collection_name).count() == 0:
return langchain_chroma

Step 2: A function to generate RAG in LangChain with document collection, or the proper RAG factory function.

# Defininig a function to get a simple RAG as Langchain chain with given hyperparemeters
# RAG returns also the context documents retrieved for evaluation purposes in RAGAs

def get_chain(chroma_client,
lambda_mult=0.25) -> RunnableSequence:

vectordb_collection = get_vectordb_collection(chroma_client=chroma_client,

retriever = vectordb_collection.as_retriever(top_k=top_k, lambda_mult=lambda_mult)

template = """Answer the question based only on the following context.
If the context doesn't contain entities present in the question say you don't know.


Question: {question}
prompt = ChatPromptTemplate.from_template(template)
llm = ChatOpenAI(model=llm_model)

def format_docs(docs):
return "\n\n".join([doc.page_content for doc in docs])

chain_from_docs = (
RunnablePassthrough.assign(context=(lambda x: format_docs(x["context"])))
| prompt
| llm
| StrOutputParser()

chain_with_context_and_ground_truth = RunnableParallel(
context=itemgetter("question") | retriever,

return chain_with_context_and_ground_truth

The former function get_vectordb_collection is incorporated into the latter function get_chain, which generates our RAG chain for given set of parameters, i.e: embedding_model, llm_model, chunk_size, overlap_size, top_k, lambda_mult. With our factory function we are just scratching the surface of possibilities what hyperparmeters of our RAG system we optimise. Note also that RAG chain will require 2 arguments: question and ground_truth, where the latter is just passed through the RAG chain as it is required for evaluation using RAGAs.

# Setting up a ChromaDB client
chroma_client = chromadb.EphemeralClient()

# Testing full text rag

with warnings.catch_warnings():
rag_prototype = get_chain(chroma_client=chroma_client,

rag_prototype.invoke({"question": 'What happened in Minneapolis to the bridge?',
"ground_truth": "x"})["answer"]

RAG Evaluation

To evaluate our RAG we will use the diverse dataset of news articles from CNN and Daily Mail, which is available on Hugging Face [4]. Most articles in this dataset are below 1000 words. In addition we will use the tiny extract from the dataset of just 100 news articles. This is all done to limit the costs and time needed to run the demo.

# Getting the tiny extract of CCN Daily Mail dataset
synthetic_evaluation_set_url = "https://gist.github.com/gox6/0858a1ae2d6e3642aa132674650f9c76/raw/synthetic-evaluation-set-cnn-daily-mail.csv"
synthetic_evaluation_set_pl = pl.read_csv(synthetic_evaluation_set_url, separator=",").drop("index")
# Train/test split
# We need at least 2 sets: train and test for RAG optimization.

shuffled = synthetic_evaluation_set_pl.sample(fraction=1,
test_fraction = 0.5

test_n = round(len(synthetic_evaluation_set_pl) * test_fraction)
train, test = (shuffled.head(-test_n),
shuffled.head( test_n))

As we will consider many different RAG prototypes beyond the one define above we need a function to collect answers generated by the RAG on our synthetic evaluation set:

# We create the helper function to generate the RAG ansers together with Ground Truth based on synthetic evaluation set
# The dataset for RAGAS evaluation should contain the columns: question, answer, ground_truth, contexts
# RAGAs expects the data in Huggingface Dataset format

def generate_rag_answers_for_synthetic_questions(chain,
synthetic_evaluation_set) -> pl.DataFrame:

df = pl.DataFrame()

for row in synthetic_evaluation_set.iter_rows(named=True):
rag_output = chain.invoke({"question": row["question"],
"ground_truth": row["ground_truth"]})
rag_output["contexts"] = [doc.page_content for doc
in rag_output["context"]]
del rag_output["context"]
rag_output_pp = {k: [v] for k, v in rag_output.items()}
df = pl.concat([df, pl.DataFrame(rag_output_pp)], how="vertical")

return df

RAG Optimisation with RAGAs and Optuna

First, it is worth emphasising that the proper optimisation of RAG system should involve global optimisation, where all parameters are optimised at once, in contrast to the sequential or greedy approach, where parameters are optimised one by one. The sequential approach ignores the fact that there can be interactions between the parameters, which can result in sub-optimal solution.

Now at last we are ready to optimise our RAG system. We will use hyperparameter optimisation framework Optuna. To this end we define the objective function for the Optuna’s study specifying the allowed hyperparameter space as well as computing the evaluation metric, see the code below:

def objective(trial):

embedding_model = trial.suggest_categorical(name="embedding_model",
choices=["text-embedding-ada-002", 'text-embedding-3-small'])

chunk_size = trial.suggest_int(name="chunk_size",

overlap_size = trial.suggest_int(name="overlap_size",

top_k = trial.suggest_int(name="top_k",

challenger_chain = get_chain(chroma_client,
overlap_size= overlap_size ,

challenger_answers_pl = generate_rag_answers_for_synthetic_questions(challenger_chain , train)
challenger_answers_hf = Dataset.from_pandas(challenger_answers_pl.to_pandas())

challenger_result = evaluate(challenger_answers_hf,

return challenger_result['answer_correctness']

Finally, having the objective function we define and run the study to optimise our RAG system in Optuna. It’s worth noting that we can add to the study our educated guesses of hyperparameters with the method enqueue_trial, as well as limit the study by time or number of trials, see the Optuna’s docs for more tips.

sampler = optuna.samplers.TPESampler(seed=6)
study = optuna.create_study(study_name="RAG Optimisation",

educated_guess = {"embedding_model": "text-embedding-3-small",
"chunk_size": 1000,
"overlap_size": 200,
"top_k": 3}


print(f"Sampler is {study.sampler.__class__.__name__}")
study.optimize(objective, timeout=180)

In our study the educated guess wasn’t confirmed, but I’m sure that with rigorous approach as the one proposed above it will get better.

Best trial with answer_correctness: 0.700130617593832
Hyper-parameters for the best trial: {'embedding_model': 'text-embedding-ada-002', 'chunk_size': 700, 'overlap_size': 400, 'top_k': 9}

Limitations of RAGAs

After experimenting with ragas library to synthesise evaluations sets and to evaluate RAGs I have some caveats:

  • The question may contain the answer.
  • The ground-truth is just the literal excerpt from the document.
  • Issues with RateLimitError as well as network overflows on Colab.
  • Built-in evolutions are few and there is no easy way to add new, ones.
  • There is room for improvements in documentation.

The first 2 caveats are quality related. The root cause of them may be in the LLM used, and obviously GPT-4 gives better results than GPT-3.5-Turbo. At the same time it seems that this could be improved by some prompt engineering for evolutions used to generate synthetic evaluation sets.

As for issues with rate-limiting and network overflows it is advisable to use: 1) checkpointing during generation of synthetic evaluation sets to prevent loss of of created data, and 2) exponential backoff to make sure you complete the whole task.

Finally and most importantly, more built-in evolutions would be welcome addition for the ragas package. Not to mention the possibility of creating custom evolutions more easily.

Other Useful Features of RAGAs

  • Custom Prompts. The ragas package provides you with the option to change the prompts used in the provided abstractions. The example of custom prompts for metrics in the evaluation task is described in the docs. Below I use custom prompts for modifying evolutions to mitigate quality issues.
  • Automatic Language Adaptation. RAGAs has you covered for non-English languages. It has a great feature called automatic language adaptation supporting RAG evaluation in the languages other than English, see the docs for more info.


Despite RAGAs limitations do NOT miss the most important thing:

RAGAs is already very useful tool despite its young age. It enables generation of synthetic evaluation set for rigorous RAG evaluation, a critical aspect for successful RAG development.

Recent Articles

Related Stories

Leave A Reply

Please enter your comment!
Please enter your name here