MM-Eval: A Multilingual Meta-Evaluation Benchmark for LLM-as-a-Judge and Reward Models

Son, Guijin, Yoon, Dongkeun, Suk, Juyoung, Aula-Blasco, Javier, Aslan, Mano, Kim, Vu Trong, Islam, Shayekh Bin, Prats-Cristià, Jaume, Tormo-Bañuelos, Lucía, Kim, Seungone

Oct-23-2024–arXiv.org Artificial Intelligence

Large language models (LLMs) are commonly used as evaluators in tasks (e.g., reward modeling, LLM-as-a-judge), where they act as proxies for human preferences or judgments. This leads to the need for meta-evaluation: evaluating the credibility of LLMs as evaluators. However, existing benchmarks primarily focus on English, offering limited insight into LLMs' effectiveness as evaluators in non-English contexts. To address this, we introduce MM-Eval, a multilingual meta-evaluation benchmark that covers 18 languages across six categories. MM-Eval evaluates various dimensions, including language-specific challenges like linguistics and language hallucinations. Evaluation results show that both proprietary and open-source language models have considerable room for improvement. Further analysis reveals a tendency for these models to assign middle-ground scores to low-resource languages. We publicly release our benchmark and code.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

Oct-23-2024

arXiv.org PDF

Add feedback

Country:
- Asia (1.00)
- Europe > Middle East
  - Malta (0.14)
- North America > Mexico
  - Mexico City (0.14)

Genre:
- Research Report > New Finding (0.87)

Industry:
- Education (1.00)
- Transportation (0.68)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)
  - Natural Language > Large Language Model (1.00)