Deconstructing Self-Bias in LLM-generated Translation Benchmarks
Xu, Wenda, Agrawal, Sweta, Zouhar, Vilém, Freitag, Markus, Deutsch, Daniel
–arXiv.org Artificial Intelligence
As large language models (LLMs) begin to saturate existing benchmarks, automated benchmark creation using LLMs (LLM-as-a-benchmark) has emerged as a scalable alternative to slow and costly human curation. While these generated test sets have to potential to cheaply rank models, we demonstrate a critical flaw. LLM-generated benchmarks systematically favor the model that created the benchmark: they exhibit self-bias on low resource languages to English translation tasks. We show three key findings on automatic benchmarking of LLMs for translation: First, this bias originates from two sources: the generated test data (LLM-as-a-testset) and the evaluation method (LLM-as-an-evaluator), with their combination amplifying the effect. Second, self-bias in LLM-as-a-benchmark is heavily influenced by the model's generation capabilities in the source language. For instance, we observe more pronounced bias in into-English translation, where the model's generation system is developed, than in out-of-English translation tasks. Third, we observe that low diversity in source text is one attribution to self-bias. Our results suggest that improving the diversity of these generated source texts can mitigate some of the observed self-bias. The rapid advancements in Large Language Models (LLMs) have led to an unprecedented saturation of existing, meticulously human-curated benchmarks. This phenomenon exposes two critical, intertwined challenges: traditional benchmark creation is too laborious and expensive to keep pace with rapid model development, and this challenge is compounded by the inherent difficulty of constructing high-quality benchmarks for low-resource languages, even with human labor, which further strains existing benchmark resources.
arXiv.org Artificial Intelligence
Oct-1-2025
- Country:
- Asia
- China (0.04)
- India > Maharashtra
- Mumbai (0.04)
- Japan > Honshū
- Kantō > Tokyo Metropolis Prefecture > Tokyo (0.04)
- Middle East > UAE
- Dubai Emirate > Dubai (0.04)
- Singapore (0.04)
- South Korea > Seoul
- Seoul (0.04)
- Europe
- United Kingdom (0.04)
- Italy (0.04)
- France (0.04)
- Portugal > Lisbon
- Lisbon (0.04)
- Germany (0.04)
- Netherlands > North Holland
- Amsterdam (0.04)
- Switzerland > Zürich
- Zürich (0.04)
- Spain > Galicia
- Madrid (0.04)
- Sweden > Stockholm
- Stockholm (0.04)
- North America > United States
- New York (0.04)
- South America > Brazil
- São Paulo (0.04)
- Asia
- Genre:
- Research Report > New Finding (1.00)
- Industry:
- Banking & Finance > Trading (1.00)
- Health & Medicine (1.00)
- Technology: