Evaluating Language Model Agency through Negotiations
Davidson, Tim R., Veselovsky, Veniamin, Josifoski, Martin, Peyrard, Maxime, Bosselut, Antoine, Kosinski, Michal, West, Robert
–arXiv.org Artificial Intelligence
Companies, organizations, and governments increasingly exploit Language Models' (LM) remarkable capability to display agent-like behavior. As LMs are adopted to perform tasks with growing autonomy, there exists an urgent need for reliable and scalable evaluation benchmarks. Current, predominantly static LM benchmarks are ill-suited to evaluate such dynamic applications. Thus, we propose jointly evaluating LM performance and alignment through the lenses of negotiation games. We argue that this common task better reflects real-world deployment conditions while offering insights into LMs' decision-making processes. Crucially, negotiation games allow us to study multi-turn, and cross-model interactions, modulate complexity, and side-step accidental data leakage in evaluation. We report results for six publicly accessible LMs from several major providers on a variety of negotiation games, evaluating both self-play and cross-play performance. Noteworthy findings include: (i) open-source models are currently unable to complete these tasks; (ii) cooperative bargaining games prove challenging; and (iii) the most powerful models do not always "win".
arXiv.org Artificial Intelligence
Jan-9-2024
- Country:
- Asia > Middle East
- UAE (0.14)
- Europe (0.92)
- North America > United States (0.28)
- Asia > Middle East
- Genre:
- Research Report (0.82)
- Industry:
- Leisure & Entertainment > Games (1.00)
- Technology: