RAG Hallucination Detection Techniques – MachineLearningMastery.com


RAG Hallucination Detection Techniques
Image by Editor | Midjourney

Introduction

Large language models (LLMs) are useful for many applications, including question answering, translation, summarization, and much more, with recent advancements in the area having increased their potential. As you are undoubtedly aware, there are times when LLMs provide factually incorrect answers, especially when the response desired for a given input prompt is not represented within the model’s training data. This leads to what we call hallucinations.

To mitigate the hallucination problem, retrieval augmented generation (RAG) was developed. This technique retrieves data from a knowledge base which could help satisfy a user prompt’s instructions. While a powerful technique, hallucinations can still manifest with RAG. This is why detecting hallucinations and formulating a plan to alert the user or otherwise deal with them in RAG systems is of the utmost importance.

As the foremost point of importance with contemporary LLM systems is the ability to trust their responses, the focus on detecting and handling hallucinations has become more important than ever.

In a nutshell, RAG works by retrieving information from a knowledge base using various types of search such as sparse or dense retrieval techniques. The most relevant results will then be passed into LLM alongside the user prompt in order to generate the desired output. However, hallucination can still occur in the output for numerous reasons, including:

  • LLMs acquire accurate information, but they fail to generate correct responses. It often happens if the output requires reasoning within the given information.
  • The retrieved information is incorrect or does not contain relevant information. In this case, LLM might try to answer questions and hallucinate.

As we are focusing on hallucinations in our discussion, we will focus on trying to detect the generated responses from RAG systems, as opposed to trying to fix the retrieval aspects. In this article, we will explore hallucination detection techniques that we can use to help build better RAG systems.

Hallucination Metrics

The first thing we will try is to use the hallucination metrics from the DeepEval library. Hallucination metrics are a simple approach to determining whether the model generates factual, correct information using a comparison method. It’s calculated by adding the number of context contradictions to the total number of contexts.

Let’s try it out with code examples. First, we need to install the DeepEval library.

The evaluation will be based on the LLM that evaluates the result. This means we will need the model as an evaluator. For this example, we will use the OpenAI model that is set by default from DeepEval. You can check the following documentation to switch to another LLM. As such, you will need to make available your OpenAI API key.

With the library installed, we will try to detect the hallucination that is present in the LLM output. First, let’s set up the context or the fact that should be present from the input. We will also create the actual output from the model in order to dictate what it is we are testing.

Next, we will set up the test case and set up the Hallucination Metrics. The threshold is something you want to set to tolerate how high the hallucination is allowed to be. If you want a strict no hallucination, then you can set it to zero.

Let’s run the test and see the result.

The hallucination metrics show a score of 1, which means the output is completely hallucinating. DeepEval also provides the reasons.

G-Eval

G-Eval is a framework that uses LLM with chain-of-thoughts (CoT) methods to automatically evaluate the LLM output based on a multi-step criteria we decide upon. We will then use DeepEval’s G-Eval framework and our criteria to test the RAG’s ability to generate output and determine whether they are hallucinating.

With G-Eval, we will need to set up the metrics ourselves based on our criteria and the evaluation steps. Here is how we set up the framework.

Next, we will set up the test cases to simulate the RAG process. We will set up the user input, both the generated output and the expected output, and lastly, the retrieval context, which is the information pulled up by RAG.

Now, let’s use the G-Eval framework we have set up previously.

Output:

With the G-Eval framework we set, we can see that it can detect hallucinations that come from the RAG. The documentation provides further explanation about how the score is calculated.
 

Faithfulness Metric

If you want more quantified metrics, we can try out the RAG-specific metrics to test whether or not the retrieval process is good. The metrics also include a metric to detect hallucination called faithfulness.

There are five RAG-specific metrics available in DeepEval to use, which are:

  1. Contextual precision to evaluate the reranker
  2. Contextual recall to evaluate the embedding model to capture and retrieve relevant information accurately
  3. Contextual relevancy evaluates the text chunk size and the top-K
  4. Contextual answer relevancy evaluates whether the prompt is able to instruct the LLM to generate a relevant answer
  5. Faithfulness evaluates whether the LLM generates output that does not hallucinate and contradict any information in the retrieval

These metrics differ from the hallucination metric previously discussed, as these metrics focus on the RAG process and output. Let’s try these out with our test case from the above example to see how they perform.

Output:

The result shows that the RAG is performing well except for the contextual relevancy and faithfulness metrics. These metrics are able to detect the hallucinations that occur from the RAG system using the faithfulness metric along with the reasoning.

Summary

This article has explored different techniques for detecting hallucinations in RAG systems, focusing on three main approaches:

  • hallucination metrics using the DeepEval library
  • G-Eval framework with chain-of-thoughts methods
  • RAG-specific metrics including faithfulness evaluation

We have looked at some practical code examples for implementing each technique, demonstrating how they can measure and quantify hallucinations in LLM outputs, with a particular emphasis on comparing generated responses against known context or expected outputs.

Best of luck with your RAG system optimizing, and I hope that this has helped.

Recent Articles

Related Stories

Leave A Reply

Please enter your comment!
Please enter your name here