MIRAGE-Bench: Automatic Multilingual Benchmark Arena for Retrieval-Augmented Generation Systems

Thakur, Nandan, Kazi, Suleman, Luo, Ge, Lin, Jimmy, Ahmad, Amin

Oct-17-2024–arXiv.org Artificial Intelligence

Traditional Retrieval-Augmented Generation (RAG) benchmarks rely on different heuristic-based metrics for evaluation, but these require human preferences as ground truth for reference. In contrast, arena-based benchmarks, where two models compete each other, require an expensive Large Language Model (LLM) as a judge for a reliable evaluation. We present an easy and efficient technique to get the best of both worlds. The idea is to train a learning to rank model as a "surrogate" judge using RAG-based evaluation heuristics as input, to produce a synthetic arena-based leaderboard. Using this idea, We develop MIRAGE-Bench, a standardized arena-based multilingual RAG benchmark for 18 diverse languages on Wikipedia. The benchmark is constructed using MIRACL, a retrieval dataset, and extended for multilingual generation evaluation. MIRAGE-Bench evaluates RAG extensively coupling both heuristic features and LLM as a judge evaluator. In our work, we benchmark 19 diverse multilingual-focused LLMs, and achieve a high correlation (Kendall Tau ($\tau$) = 0.909) using our surrogate judge learned using heuristic features with pairwise evaluations and between GPT-4o as a teacher on the MIRAGE-Bench leaderboard using the Bradley-Terry framework. We observe proprietary and large open-source LLMs currently dominate in multilingual RAG. MIRAGE-Bench is available at: https://github.com/vectara/mirage-bench.

large language model, machine learning, natural language, (22 more...)

arXiv.org Artificial Intelligence

Oct-17-2024

arXiv.org PDF

Add feedback

Country:
- North America
  - Mexico (0.04)
  - United States
    - Pennsylvania (0.04)
    - Maryland (0.04)
    - District of Columbia > Washington (0.04)
    - New York > New York County
      - New York City (0.04)
    - Louisiana > Orleans Parish
      - New Orleans (0.04)
  - Canada
    - Quebec > Montreal (0.04)
    - British Columbia (0.04)
    - Ontario
      - Waterloo Region > Waterloo (0.04)
      - Toronto (0.04)
    - Nova Scotia > Halifax Regional Municipality
      - Halifax (0.04)
- Europe
  - Austria > Vienna (0.14)
  - United Kingdom (0.04)
  - Switzerland (0.04)
  - Germany (0.04)
  - Spain > Catalonia
    - Barcelona Province > Barcelona (0.04)
  - Italy
    - Tuscany > Florence (0.04)
    - Calabria > Catanzaro Province
      - Catanzaro (0.04)
  - Belgium > Brussels-Capital Region
    - Brussels (0.04)
- Asia
  - Singapore (0.04)
  - South Korea (0.04)
  - British Indian Ocean Territory > Diego Garcia (0.04)
  - Thailand > Bangkok
    - Bangkok (0.04)
  - Myanmar > Tanintharyi Region
    - Dawei (0.04)
  - Middle East
    - Jordan (0.04)
    - Saudi Arabia > Asir Province
      - Abha (0.04)
- Africa
  - Rwanda > Kigali
    - Kigali (0.04)
  - Ethiopia > Addis Ababa
    - Addis Ababa (0.04)

Genre:
- Research Report (1.00)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language
    - Large Language Model (1.00)
    - Chatbot (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found