Using the RAG Triad for RAG evaluation

Q: Why is the RAG triad referenceless?

None of the three metrics requires expected_output. They score relevancy, faithfulness, and retrieval quality directly from the input, actual_output, and retrieval_context—so you can evaluate RAG even when you don't have a labelled ground truth answer.

Q: How do I scale RAG triad evaluation to many test cases?

Use DeepEval's Synthesizer to generate hundreds of Goldens from your knowledge base, then pass them to evaluate() with the RAG triad metrics.

Retrieval-Augmented Generation (RAG) is a powerful way for LLMs to generate responses based on context beyond the scope of its training data by supplying it with external data as additional context. These supporting context comes in the form of text chunks, which are usually parsed, vectorized, and indexed in vector databases for fast retrieval at inference time, hence the name retrieval, augmented, generation.

In a previous guide, we explored how the generator in a RAG pipeline can hallucinate despite being supplied additional context, while the retriever can often fail to retrieve the correct and relevant context to generate the optimal answer. This is why evaluating RAG pipelines are important and where the RAG triad comes into play.

What is the RAG Triad?

The RAG triad is composed of three RAG evaluation metrics: answer relevancy, faithfulness, and contextual relevancy. If a RAG pipeline scores high on all three metrics, we can confidently say that our RAG pipeline is using the optimal hyperparameters. This is because each metric in the RAG triad corresponds to a certain hyperparameter in the RAG pipeline. For instance:

Answer relevancy: the answer relevancy metric determines how relevant the answers generated by your RAG generator is. Since LLMs nowadays are getting pretty good at reasoning, it is mainly the prompt template hyperparameter instead of the LLM you are iterating on when working with the answer relevancy metric. To be more specific, a low answer relevancy score signifies that you need to improve examples used in prompt templates for better in-context learning, or include more fine-grained prompting for better instruction following capabilities to generate more relevant responses.
Faithfulness: the faithfulness metric determines how much the answers generated by your RAG generator are hallucinations. This concerns the LLM hyperparameter, and you'll want to switch to a different LLM or even fine-tune your own if your LLM is unable to leverage the retrieval context supplied to it to generate grounded answers.

Info
You might also see the faithfulness metric called groundedness instead in other places. They are 100% the same thing but just named differently.
Contextual Relevancy: the contextual relevancy metric determines whether the text chunks retrieved by your RAG retriever are relevant to producing the ideal answer for a user input. This concerns the chunk size, top-K and embedding model hyperparameter. A good embedding model ensures you're able to retrieve text chunks that are semantically similar to the embedded user query, while a good combination of chunk size and top-K ensures you only select the most important bits of information in your knowledge base.

Using the RAG Triad in DeepEval

Using the RAG triad of metrics in deepeval is as simple as writing a few lines of code. First, create a test case to represent a user query, retrieved text chunks, and an LLM response:

from deepeval.test_case import LLMTestCase

test_case = LLMTestCase(input="...", actual_output="...", retrieval_context=["..."])

Here, input is the user query, actual_output is the LLM generated response, and retrieval_context is a list of strings representing the retrieved text chunks. Then, define the RAG triad metrics:

from deepeval.metrics import AnswerRelevancyMetric, FaithfulnessMetric, ContextualRelevancyMetric

...
answer_relevancy = AnswerRelevancyMetric()
faithfulness = FaithfulnessMetric()
contextual_relevancy = ContextualRelevancyMetric()

Lastly, evaluate your test case using these metrics:

from deepeval import evaluate

...
evaluate(test_cases=[test_case], metrics=[answer_relevancy, faithfulness, contextual_relevancy])

Congratulations 🎉! You've learnt everything you need to know for the RAG triad.

Scaling RAG Evaluation

As you scale up your RAG evaluation efforts, you can simply supply more test cases to the list of test_cases in the evaluate() function and more importantly, you can also generate synthetic datasets using deepeval to test your RAG application at scale.

FAQs

What is the RAG triad?

The RAG triad is a referenceless evaluation framework for RAG that combines three metrics: AnswerRelevancyMetric, FaithfulnessMetric, and ContextualRelevancyMetric. High scores across all three indicate that your RAG pipeline is using the right hyperparameters end-to-end.

Why is the RAG triad referenceless?

None of the three metrics requires expected_output. They score relevancy, faithfulness, and retrieval quality directly from the input, actual_output, and retrieval_context—so you can evaluate RAG even when you don't have a labelled ground truth answer.

What hyperparameter does each RAG triad metric target?

Answer relevancy targets the prompt template; faithfulness targets the generator LLM; contextual relevancy targets chunk size, top-K, and the embedding model. A low score on any single metric points you straight to the hyperparameter to tune.

Is faithfulness the same as groundedness?

Yes. Faithfulness and groundedness are two names for the same concept—how well the generated answer is supported by the retrieval_context, with no hallucinated claims.

How is contextual relevancy different from contextual precision?

Contextual relevancy is referenceless: it scores how relevant the retrieved chunks are to the input. Contextual precision and contextual recall are reference-based and require expected_output to measure ranking and coverage of the ideal answer's information.

Do I need labeled data to use the RAG triad?

No. The whole point of the RAG triad is that it's fully referenceless. You can evaluate RAG with just input, actual_output, and retrieval_context.

How do I scale RAG triad evaluation to many test cases?

Use DeepEval's Synthesizer to generate hundreds of Goldens from your knowledge base, then pass them to evaluate() with the RAG triad metrics.

What is the RAG Triad?

Using the RAG Triad in DeepEval

Scaling RAG Evaluation

FAQs

On this page