Deviation Ratings: A General, Clone-Invariant Rating Method

Marris, Luke, Liu, Siqi, Gemp, Ian, Piliouras, Georgios, Lanctot, Marc

Feb-17-2025–arXiv.org Artificial Intelligence

Many real-world multi-agent or multi-task evaluation scenarios can be naturally modelled as normal-form games due to inherent strategic (adversarial, cooperative, and mixed motive) interactions. These strategic interactions may be agentic (e.g. In such a formulation, it is the strategies (actions, policies, agents, models, tasks, prompts, etc.) that are rated. However, the rating problem is complicated by redundancy and complexity of N-player strategic interactions. Repeated or similar strategies can distort ratings for those that counter or complement them. Previous work proposed "clone invariant" ratings to handle such redundancies, but this was limited to two-player zero-sum (i.e. This work introduces the first N-player generalsum clone invariant rating, called deviation ratings, based on coarse correlated equilibria. The rating is explored on several domains including LLMs evaluation. Data often captures relationships within a set (e.g., chess match outcomes) or between sets (e.g., film ratings by demographics). These sets can represent anything including human players, machine learning models, tasks, or features. The interaction data, often scalar (win rates, scores, or other metrics), may be symmetric, asymmetric or arbitrary. These interactions can be strategic, either in an agentic sense (e.g., players aiming to win) or due to inherent trade-offs (e.g., cost vs quality). This can lead to a game-theoretic interpretation: sets as players, elements as strategies, and interaction statistics as payoffs. This framing is common in analyzing strategic interactions between entities like Premier League teams, chess players (Sanjaya et al., 2022), reinforcement learning agents and tasks (Balduzzi et al., 2018), or even language models (Chiang et al., 2024). More generally, the idea of formulating real-world interactions as normal-form games, empirical game-theoretic analysis (Wellman, 2006), is well explored.

equilibrium, large language model, machine learning, (21 more...)

arXiv.org Artificial Intelligence

Feb-17-2025

arXiv.org PDF

Add feedback

Country:
- Europe (0.67)
- North America > United States (0.67)

Genre:
- Research Report (0.51)

Industry:
- Leisure & Entertainment > Games > Chess (1.00)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning
    - Neural Networks > Deep Learning (1.00)
    - Reinforcement Learning (1.00)
  - Natural Language
    - Chatbot (1.00)
    - Large Language Model (1.00)
  - Representation & Reasoning > Agents (0.88)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found