Evaluation of RAG systems

Hi,

The implementation of this article is here.

RAGs are complex systems. This is obvious, when you try to evaluate them. There are multiple aspects, which need to be checked. Here, I try to look into different approaches to get a better understanding and problems, when facing RAG systems. RAG system evaluation involves two distinct parts: retrieval and generation part. For retrieval, context relevance and noise robustness are key factors in assessing quality, while for generation part, key factors like answer faithfulness, answer relevance, negative rejection, information integration, and counterfactual robustness are important (Gao et al. 2024).

Retrieval aspects:

Context relevance: Ensures that retrieved information is precise and directly related to the query.
Noise robustness: Measures the system’s ability to ignore unrelated or irrelevant documents.

Generation aspects:

Answer faithfulness: Ensures that the generated answers remain true to the retrieved context.
Answer relevance: Assesses whether the generated response directly addresses the user query.
Negative rejection: Evaluates the system’s ability to refrain from answering when insufficient reliable information is available.
Information integration: Measures the system’s capability to synthesize knowledge from multiple sources.
Counterfactual robustness: Ensures the identification and dismissal of incorrect or misleading information.

The evaluation can be done in multiple ways. Often text overlap, semantic similarity or a classification are used as a traditional method. Also a LLM-as-a-Judge is used for each of those different aspects. Additional, the LLM-as-a-judge is part of available frameworks. As a framework, I only look into DeepEval. (I try to avoid the langchain universe.) Information integration and Counterfactual robustness are complex metrics, which only can be evaluated with a LLM-as-a-judge. LLM-as-a-judge needs careful prompting and evaluation. The evaluation is often not binary. The LLM-as-a-judge is a complex system, which needs to be evaluated itself.

Ressoures:

Retrieval-Augmented Generation for Large Language Models: A Survey (Gao et al. 2024)
Evaluation of Retrieval-Augmented Generation: A Survey (Yu et al. 2024))
A Survey on LLM-as-a-Judge (Gu et al. 2025)
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena (Zheng et al. 2023)

Guides:

Frameworks:

Context relevance

Context relevance measures wheter the retrieved information contains the answer of the user query. The context relevance focus before the answer generation. This impacts the quality of the generated answer. Irrelevant context can lead to hallucinations or incorrect answers. The relevance is not binary. It can be relevant, partially relevant, tangentially related, or irrelevant.

There are two options to evaluate the context relevance:

similarity score via a threshold
LLM-as-a-judge

Noise robustness

Noise robustness measures the system’s ability to ignore unrelated or irrelevant documents. It assesses if the generator can effectively ignore the noise. This is important, since retireval systems are imperfect. Noise is not only off-topic information. It can also be outdated, contradictory, repetetive, redundant or slightly related information. A robust RAG system should be able to handle irrelevant information (noise). Poor noise robustness can lead to hallucinations in the generated answer.

The metrics are the same as before:

semantic similarity (tricky, since hard to measure)
LLM-as-a-judge

Answer faithfulness

Answer faithfulness ensures that the generated answers remain true to the retrieved context. This measures, whether the generated answer is based on the information present in the retrieved context. A faithful answer doesn’t contradict the context or introduce external information or facts that are missing in the context. This is important, since the purpose of a RAG is to answer based on the information of the knowledge base and not hallucinate the answer. An answer can be relevant to the user query but still unfaithful if it makes claims not supported by the retrieved context. This is a common problem, since LLMs are trained on a lot of data and can easily mix up internal information. Evaluation is hard, since LLM often paraphrase the context. So token overlap is not a good metric.

Faithfulness can be evaluatet by:

measuring the token overlap between answer and context via metrics (ROGUE, BLEU)
natural language inference models (depends on model biases and domain specificity)
LLM-as-a-judge

Answer relevance

Answer relevance ensures that the generated answers directly addresses the user query. The evaluation is, if the answer is on-topic and provides the information the user was actually asking for. For example, the answer faithfulness ignores in its evaluation the original user question. It differs from context relevance, since a relevant context can still lead to an irrelevant answer. It differs from answer faithfulness, since an answer can be faithful to the context but not relevant to the user query. Answer relevance can be seen as the closest proxy for the user satisfaction. If the answer is not relevant to the query, the system has failed regardless how relevant or faithful the answer was.

Relevance can be evaluated by:

sementic similiarity
LLM-as-a-judge

Negative rejection

Negative rejection evaluates the system’s ability to refrain from answering when insufficient reliable information is available. This means, that the RAG system correctly identifies, when it can’t answer a query based on the retrieved context. This can be caused by irrelevant or contradictory information, insufficient context, or the query being out of scope. The rejection is important, since it prevents the system from hallucinating or providing incorrect information. A rejection is a sign of robustness and reliability. A RAG system that rejects queries when it doesn’t have enough information is more trustworthy than one that tries to answer everything.

Negative rejection can be evaluated via:

zero-shot classification
LLM-as-a-judge

Information integration

This metric becomes critical when the answer isn’t neatly contained in one retrieved context but requires combining multiple documents. Information integration measures the system’s capability to synthesize knowledge from multiple sources. The focus is, how well the system combines retrieved documents with the generated response to produce a coherent, accurate, and contextually relevant answer. It’s about merging external info into the final output effectively. This includes the synthesis of different facts, conflict resolution, selection of relevant information and summarization with de-duplication. But also how contradicting information are handled.

Since this is a complex task, a LLM-as-a-judge is the most practible approach to evaluate the LLM generation. Still this needs a lot of manual calibration and testing.

Counterfactual robustness

Counterfactual robustness ensures the identification and dismissal of incorrect or misleading information. It evaluates, how the LLM answer change based on counterfactual information or get confused and produce nonsensical or unrelated hallucinations. Counterfactual robustness tests, if the the answer is based on the provided context, even if the context is deliberately wrong. It differs from faithfulness by checking to known incorrect context. A system can be “faithful” to counterfacts, which is not the goal. A robust system should ideally reflect the information, highlight the counterfactual information if possible. Counterfactual changes can be made by changing entities, alter attributes, flip statements or modify relationships. It really is stress-testing the system’s reliance on context vs. internal knowledge.

Counterfactual robustness tests, how much the generation relies on the retrieval part. This is another complex metric, which needs a LLM-as-a-judge for evaluation.

Conclusion

RAG systems are complicated and need a lot of testing and needs a good dataset. The LLM-as-a-judge seems for me to be the best and most reliable approach, although it needs a lot of manual testing and calibration. The evaluation of RAG systems is an ongoing research area and this is a short summary of different aspects. I hope this article gives you a better understanding of the different aspects and challenges of RAG system evaluation.

Thank you for your attention.

Context relevance#

Noise robustness#

Answer faithfulness#

Answer relevance#

Negative rejection#

Information integration#

Counterfactual robustness#

Conclusion#