MTRAG: A Multi-Turn Conversational Benchmark for Evaluating Retrieval-Augmented Generation Systems
Katsis, Yannis, Rosenthal, Sara, Fadnis, Kshitij, Gunasekara, Chulaka, Lee, Young-Suk, Popa, Lucian, Shah, Vraj, Zhu, Huaiyu, Contractor, Danish, Danilevsky, Marina
–arXiv.org Artificial Intelligence
Retrieval-augmented generation (RAG) has recently become a very popular task for Large Language Models (LLMs). Evaluating them on multi-turn RAG conversations, where the system is asked to generate a response to a question in the context of a preceding conversation is an important and often overlooked task with several additional challenges. We present MTRAG: an end-to-end human-generated multi-turn RAG benchmark that reflects several real-world properties across diverse dimensions for evaluating the full RAG pipeline. MTRAG contains 110 conversations averaging 7.7 turns each across four domains for a total of 842 tasks. We also explore automation paths via synthetic data and LLM-as-a-Judge evaluation. Our human and automatic evaluations show that even state-of-the-art LLM RAG systems struggle on MTRAG. We demonstrate the need for strong retrieval and generation systems that can handle later turns, unanswerable questions, non-standalone questions, and multiple domains. MTRAG is available at https://github.com/ibm/mt-rag-benchmark.
arXiv.org Artificial Intelligence
Jan-6-2025
- Country:
- Asia (0.92)
- Europe (0.67)
- North America > United States (1.00)
- Genre:
- Research Report > New Finding (0.46)
- Industry:
- Government (0.94)
- Leisure & Entertainment (1.00)
- Media > Film (1.00)
- Technology: