non-transitivity
Investigating Non-Transitivity in LLM-as-a-Judge
Xu, Yi, Ruis, Laura, Rocktäschel, Tim, Kirk, Robert
Automatic evaluation methods based on large language models (LLMs) are emerging as the standard tool for assessing the instruction-following abilities of LLM-based agents. The most common method in this paradigm, pairwise comparisons with a baseline model, critically depends on the assumption of transitive preferences. However, the validity of this assumption remains largely unexplored. In this study, we investigate the presence of non-transitivity within the AlpacaEval framework and analyze its effects on model rankings. We find that LLM judges exhibit non-transitive preferences, leading to rankings that are sensitive to the choice of the baseline model. To mitigate this issue, we show that round-robin tournaments combined with Bradley-Terry models of preference can produce more reliable rankings. Notably, our method increases both the Spearman correlation and the Kendall correlation with Chatbot Arena (95.0% -> 96.4% and 82.1% -> 86.3% respectively). To address the computational cost of round-robin tournaments, we propose Swiss-Wise Iterative Matchmaking (Swim) tournaments, using a dynamic matching strategy to capture the benefits of round-robin tournaments while maintaining computational efficiency.
- North America > United States > Florida > Miami-Dade County > Miami (0.04)
- North America > Dominican Republic (0.04)
- Asia > Thailand > Bangkok > Bangkok (0.04)
- Asia > Middle East > Jordan (0.04)
Measuring the Non-Transitivity in Chess
It has long been believed that Chess is the \emph{Drosophila} of Artificial Intelligence (AI). Studying Chess can productively provide valid knowledge about complex systems. Although remarkable progress has been made on solving Chess, the geometrical landscape of Chess in the strategy space is still mysterious. Judging on AI-generated strategies, researchers hypothesised that the strategy space of Chess possesses a spinning top geometry, with the upright axis representing the \emph{transitive} dimension (e.g., A beats B, B beats C, A beats C), and the radial axis representing the \emph{non-transitive} dimension (e.g., A beats B, B beats C, C beats A). However, it is unclear whether such a hypothesis holds for real-world strategies. In this paper, we quantify the non-transitivity in Chess through real-world data from human players. Specifically, we performed two ways of non-transitivity quantifications -- Nash Clustering and counting the number of Rock-Paper-Scissor cycles -- on over one billion match data from Lichess and FICS. Our findings positively indicate that the strategy space occupied by real-world Chess strategies demonstrates a spinning top geometry, and more importantly, there exists a strong connection between the degree of non-transitivity and the progression of a Chess player's rating. In particular, high degrees of non-transitivity tend to prevent human players from making progress on their Elo rating, whereas progressions are easier to make at the level of ratings where the degree of non-transitivity is lower. Additionally, we also investigate the implication of the degree of non-transitivity for population-based training methods. By considering \emph{fixed-memory Fictitious Play} as a proxy, we reach the conclusion that maintaining large-size and diverse populations of strategies is imperative to training effective AI agents in solving Chess types of games.
Measuring the Non-Transitivity in Chess
Sanjaya, Ricky, Wang, Jun, Yang, Yaodong
It has long been believed that Chess is the \emph{Drosophila} of Artificial Intelligence (AI). Studying Chess can productively provide valid knowledge about complex systems. Although remarkable progress has been made on solving Chess, the geometrical landscape of Chess in the strategy space is still mysterious. Judging on AI-generated strategies, researchers hypothesised that the strategy space of Chess possesses a spinning top geometry, with the upright axis representing the \emph{transitive} dimension (e.g., A beats B, B beats C, A beats C), and the radial axis representing the \emph{non-transitive} dimension (e.g., A beats B, B beats C, C beats A). However, it is unclear whether such a hypothesis holds for real-world strategies. In this paper, we quantify the non-transitivity in Chess through real-world data from human players. Specifically, we performed two ways of non-transitivity quantifications -- Nash Clustering and counting the number of Rock-Paper-Scissor cycles -- on over one billion match data from Lichess and FICS. Our findings positively indicate that the strategy space occupied by real-world Chess strategies demonstrates a spinning top geometry, and more importantly, there exists a strong connection between the degree of non-transitivity and the progression of a Chess player's rating. In particular, high degrees of non-transitivity tend to prevent human players from making progress on their Elo rating, whereas progressions are easier to make at the level of ratings where the degree of non-transitivity is lower. Additionally, we also investigate the implication of the degree of non-transitivity for population-based training methods. By considering \emph{fixed-memory Fictitious Play} as a proxy, we reach the conclusion that maintaining large-size and diverse populations of strategies is imperative to training effective AI agents in solving Chess types of games.