Can LLMs Be Trusted for Evaluating RAG Systems? A Survey of Methods and Datasets

Brehme, Lorenz, Ströhle, Thomas, Breu, Ruth

Aug-8-2025–arXiv.org Artificial Intelligence

Can LLMs Be Trusted for Evaluating RAG Systems? Abstract--Retrieval-Augmented Generation (RAG) has advanced significantly in recent years. The complexity of RAG systems, which involve multiple components--such as indexi ng, retrieval, and generation--along with numerous other param e-ters, poses substantial challenges for systematic evaluat ion and quality enhancement. Previous research highlights that ev aluating RAG systems is essential for documenting advancements, com - paring configurations, and identifying effective approach es for domain-specific applications. This study systematically r eviews 63 academic articles to provide a comprehensive overview of state-of-the-art RAG evaluation methodologies, focusing on four key areas: datasets, retrievers, indexing and databases, a nd the generator component. We observe the feasibility of an automated evaluation approach for each component of a RAG system, leveraging an LLM capable of both generating evalua tion datasets and conducting evaluations. In addition, we found that further practical research is essential to provide compani es with clear guidance on the do's and don'ts of implementing and evaluating RAG systems. By synthesizing evaluation approa ches for key RAG components and emphasizing the creation and adaptation of domain-specific datasets for benchmarking, w e contribute to the advancement of systematic evaluation met hods and the improvement of evaluation rigor for RAG systems. Furthermore, by examining the interplay between automated approaches leveraging LLMs and human judgment, we contribute to the ongoing discourse on balancing automation and human input, clarifying their respective contributions, limita tions, and challenges in achieving robust and reliable evaluations. In recent years, Large Language Models (LLMs) have made significant progress in research and have grown increasingl y popular [1]. However, LLMs face several challenges, includ - ing issues with hallucinations caused by insufficient conte xt [2], as well as limitations in their learned content, which prevent them from addressing questions requiring specific or proprietary information [1].

artificial intelligence, large language model, natural language, (19 more...)

arXiv.org Artificial Intelligence

Aug-8-2025

arXiv.org PDF

Add feedback

Country:
- Asia > Middle East
  - Iran > Tehran Province > Tehran (0.04)
- Europe
  - Austria > Tyrol
    - Innsbruck (0.05)
  - Switzerland (0.04)

Genre:
- Overview (1.00)

Industry:
- Law (0.68)

Technology:
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found