RAGalyst: Automated Human-Aligned Agentic Evaluation for Domain-Specific RAG

Gao, Joshua, Pham, Quoc Huy, Varghese, Subin, Saurav, Silwal, Hoskere, Vedhus

Nov-7-2025–arXiv.org Artificial Intelligence

Abstract--Retrieval-Augmented Generation (RAG) is a critical technique for grounding Large Language Models (LLMs) in factual evidence, yet evaluating RAG systems in specialized, safety-critical domains remains a significant challenge. Existing evaluation frameworks often rely on heuristic-based metrics that fail to capture domain-specific nuances and other works utilize LLM-as-a-Judge approaches that lack validated alignment with human judgment. This paper introduces RAGalyst, an automated, human-aligned agentic framework designed for the rigorous evaluation of domain-specific RAG systems. RAGalyst features an agentic pipeline that generates high-quality, synthetic question-answering (QA) datasets from source documents, incorporating an agentic filtering step to ensure data fidelity. The framework refines two key LLM-as-a-Judge metrics--Answer Correctness and Answerability--using prompt optimization to achieve a strong correlation with human annotations. Applying this framework to evaluate various RAG components across three distinct domains (military operations, cybersecurity, and bridge engineering), we find that performance is highly context-dependent. No single embedding model, LLM, or hyperparam-eter configuration proves universally optimal. Additionally, we provide an analysis on the most common low Answer Correctness reasons in RAG. These findings highlight the necessity of a systematic evaluation framework like RAGalyst, which empowers practitioners to uncover domain-specific trade-offs and make informed design choices for building reliable and effective RAG systems. RAGalyst is available on our Github. Although modern Large Language Models (LLMs) are great synthesizers of information, they still suffer from hallucinations [1], [2], which refers to the generation of content that appears plausible but is factually incorrect or unsupported by evidence. Mitigating hallucinations is especially important in safety-critical applications (e.g., military operations, cybersecurity, and bridge engineering) where inaccurate information can lead to serious consequences and undermine trust in artificial intelligence (AI) systems [3], [4]. Retrieval-Augmented Generation (RAG) has been widely adopted to mitigate hallucinations by grounding responses in provided context [5], [6]. A key advantage of RAG is its ability to provide models with dynamic, inference-time access to private and domain-relevant documents [5], [7].

arxiv preprint arxiv, large language model, machine learning, (16 more...)

arXiv.org Artificial Intelligence

Nov-7-2025

arXiv.org PDF

Add feedback

Country:
- Asia > Middle East (0.46)

Genre:
- Research Report > New Finding (0.46)

Industry:
- Government > Military (1.00)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)
  - Natural Language > Large Language Model (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found