Bi'an: A Bilingual Benchmark and Model for Hallucination Detection in Retrieval-Augmented Generation
Jiang, Zhouyu, Sun, Mengshu, Zhang, Zhiqiang, Liang, Lei
–arXiv.org Artificial Intelligence
Retrieval-Augmented Generation (RAG) effectively reduces hallucinations in Large Language Models (LLMs) but can still produce inconsistent or unsupported content. Although LLM-as-a-Judge is widely used for RAG hallucination detection due to its implementation simplicity, it faces two main challenges: the absence of comprehensive evaluation benchmarks and the lack of domain-optimized judge models. To bridge these gaps, we introduce \textbf{Bi'an}, a novel framework featuring a bilingual benchmark dataset and lightweight judge models. The dataset supports rigorous evaluation across multiple RAG scenarios, while the judge models are fine-tuned from compact open-source LLMs. Extensive experimental evaluations on Bi'anBench show our 14B model outperforms baseline models with over five times larger parameter scales and rivals state-of-the-art closed-source LLMs. We will release our data and models soon at https://github.com/OpenSPG/KAG.
arXiv.org Artificial Intelligence
Feb-26-2025
- Country:
- Asia
- Europe
- France > Provence-Alpes-Côte d'Azur
- Bouches-du-Rhône > Marseille (0.04)
- Germany (0.04)
- Ireland > Leinster
- County Dublin > Dublin (0.04)
- Middle East > Malta (0.04)
- Moldova (0.04)
- Portugal (0.04)
- Romania > București - Ilfov Development Region
- Municipality of Bucharest > Bucharest (0.04)
- Spain (0.04)
- France > Provence-Alpes-Côte d'Azur
- North America
- Canada > Ontario
- Toronto (0.04)
- Dominican Republic (0.04)
- Mexico > Mexico City
- Mexico City (0.04)
- United States
- Florida > Miami-Dade County
- Miami (0.04)
- Minnesota > Hennepin County
- Minneapolis (0.14)
- New Mexico > Bernalillo County
- Albuquerque (0.04)
- Washington > King County
- Seattle (0.04)
- Florida > Miami-Dade County
- Canada > Ontario
- Genre:
- Research Report (0.82)
- Technology: