Face4RAG: Factual Consistency Evaluation for Retrieval Augmented Generation in Chinese

Xu, Yunqi, Cai, Tianchi, Jiang, Jiyan, Song, Xierui

arXiv.org Artificial Intelligence 

Despite the various FCE passages retrieved from external retrievers or search engines [27], methods proposed earlier, these methods are evaluated on datasets has demonstrated strong performance on various knowledge intensive generated by specific Large Language Models (LLMs). Without a tasks such as open domain conversation [38, 41] and question comprehensive benchmark, it remains unexplored how these FCE answering [19]. Despite its bright prospect, factual consistency remains methods perform on other LLMs with different error distributions a critical issue for RAG systems. Recent assessment reveals or even unseen error types, as these methods may fail to detect the that even for the leading-edge commercial RAG systems like Bing error types generated by other LLMs. To fill this gap, in this paper, Chat and Perplexity, barely over half of their outputs are factual we propose the first comprehensive FCE benchmark Face4RAG for consistent with the references [29]. This issue urges the need of RAG independent of the underlying LLM. Our benchmark consists studying factual consistency evaluation (FCE) in the RAG task. of a synthetic dataset built upon a carefully designed typology for Various FCE methods have been proposed to evaluate the factual factuality inconsistency error and a real-world dataset constructed consistency of specific RAG systems, among which a two-step from six commonly used LLMs, enabling evaluation of FCE methods approach shows promising results, especially for evaluating long on specific error types or real-world error distributions.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found