Deconstructing Self-Bias in LLM-generated Translation Benchmarks

Xu, Wenda, Agrawal, Sweta, Zouhar, Vilém, Freitag, Markus, Deutsch, Daniel

Oct-1-2025–arXiv.org Artificial Intelligence

As large language models (LLMs) begin to saturate existing benchmarks, automated benchmark creation using LLMs (LLM-as-a-benchmark) has emerged as a scalable alternative to slow and costly human curation. While these generated test sets have to potential to cheaply rank models, we demonstrate a critical flaw. LLM-generated benchmarks systematically favor the model that created the benchmark: they exhibit self-bias on low resource languages to English translation tasks. We show three key findings on automatic benchmarking of LLMs for translation: First, this bias originates from two sources: the generated test data (LLM-as-a-testset) and the evaluation method (LLM-as-an-evaluator), with their combination amplifying the effect. Second, self-bias in LLM-as-a-benchmark is heavily influenced by the model's generation capabilities in the source language. For instance, we observe more pronounced bias in into-English translation, where the model's generation system is developed, than in out-of-English translation tasks. Third, we observe that low diversity in source text is one attribution to self-bias. Our results suggest that improving the diversity of these generated source texts can mitigate some of the observed self-bias. The rapid advancements in Large Language Models (LLMs) have led to an unprecedented saturation of existing, meticulously human-curated benchmarks. This phenomenon exposes two critical, intertwined challenges: traditional benchmark creation is too laborious and expensive to keep pace with rapid model development, and this challenge is compounded by the inherent difficulty of constructing high-quality benchmarks for low-resource languages, even with human labor, which further strains existing benchmark resources.

artificial intelligence, large language model, natural language, (17 more...)

arXiv.org Artificial Intelligence

Oct-1-2025

arXiv.org PDF

Add feedback

Country:
- Europe (1.00)
- Asia (1.00)

Genre:
- Research Report > New Finding (1.00)

Industry:
- Health & Medicine (1.00)
- Banking & Finance > Trading (1.00)

Technology:
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found