Investigating Non-Transitivity in LLM-as-a-Judge

Xu, Yi, Ruis, Laura, Rocktäschel, Tim, Kirk, Robert

Mar-6-2025–arXiv.org Artificial Intelligence

Automatic evaluation methods based on large language models (LLMs) are emerging as the standard tool for assessing the instruction-following abilities of LLM-based agents. The most common method in this paradigm, pairwise comparisons with a baseline model, critically depends on the assumption of transitive preferences. However, the validity of this assumption remains largely unexplored. In this study, we investigate the presence of non-transitivity within the AlpacaEval framework and analyze its effects on model rankings. We find that LLM judges exhibit non-transitive preferences, leading to rankings that are sensitive to the choice of the baseline model. To mitigate this issue, we show that round-robin tournaments combined with Bradley-Terry models of preference can produce more reliable rankings. Notably, our method increases both the Spearman correlation and the Kendall correlation with Chatbot Arena (95.0% -> 96.4% and 82.1% -> 86.3% respectively). To address the computational cost of round-robin tournaments, we propose Swiss-Wise Iterative Matchmaking (Swim) tournaments, using a dynamic matching strategy to capture the benefits of round-robin tournaments while maintaining computational efficiency.

checklist, instruction, non-transitivity, (17 more...)

arXiv.org Artificial Intelligence

Mar-6-2025

arXiv.org PDF

Add feedback

Country:
- Asia
  - Middle East > Jordan (0.04)
  - Thailand > Bangkok
    - Bangkok (0.04)
- North America
  - Dominican Republic (0.04)
  - United States > Florida
    - Miami-Dade County > Miami (0.04)

Genre:
- Research Report > New Finding (1.00)

Industry:
- Leisure & Entertainment > Games (0.67)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)
  - Natural Language > Large Language Model (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found