Towards Lighter and Robust Evaluation for Retrieval Augmented Generation

Ispas, Alex-Razvan, Simon, Charles-Elie, Caspani, Fabien, Guigue, Vincent

Mar-20-2025–arXiv.org Artificial Intelligence

Large Language Models are prompting us to view more NLP tasks from a generative perspective. At the same time, they offer a new way of accessing information, mainly through the RAG framework. While there have been notable improvements for the autoregressive models, overcoming hallucination in the generated answers remains a continuous problem. A standard solution is to use commercial LLMs, such as GPT4, to evaluate these algorithms. However, such frameworks are expensive and not very transparent. Therefore, we propose a study which demonstrates the interest of open-weight models for evaluating RAG hallucination. We develop a lightweight approach using smaller, quantized LLMs to provide an accessible and interpretable metric that gives continuous scores for the generated answer with respect to their correctness and faithfulness. This score allows us to question decisions' reliability and explore thresholds to develop a new AUC metric as an alternative to correlation with human judgment. Large Language Models (LLMs) have advanced the field of Natural Language Processing (NLP) in recent years Achiam et al. (2023); Touvron et al. (2023); Jiang et al. (2024). However, some questions require information outside the knowledge scope of the model. Therefore, Retrieval Augmented Generation (RAG) Lewis et al. (2020) was proposed to enhance the quality of the answers for questions by retrieving information from a relevant knowledge base. RAG reliability remains a critical concern, particularly due to hallucinations in the generated answers. While much effort has been dedicated to improving model accuracy, a structured evaluation framework that explicitly addresses hallucination detection is still needed. In general, we want to assess the quality of an LLM answer by comparing it to a ground truth.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

Mar-20-2025

arXiv.org PDF

Add feedback

Country:
- Europe
  - Germany (0.04)
  - Ukraine > Kyiv Oblast
    - Kyiv (0.04)
  - Spain > Catalonia
    - Barcelona Province > Barcelona (0.04)
- Asia > Myanmar
  - Tanintharyi Region > Dawei (0.04)
- Africa > South Sudan
  - Equatoria > Central Equatoria > Juba (0.04)

Genre:
- Research Report (0.40)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Large Language Model (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found