ZeroSumEval: An Extensible Framework For Scaling LLM Evaluation with Inter-Model Competition

Alyahya, Hisham A., Khan, Haidar, Alnumay, Yazeed, Bari, M Saiful, Yener, Bülent

Mar-10-2025–arXiv.org Artificial Intelligence

We introduce ZeroSumEval, a dynamic, competition-based, and evolving evaluation framework for Large Language Models (LLMs) that leverages competitive games. ZeroSumEval encompasses a diverse suite of games, including security challenges (Capture the Flag), classic board games (chess), and knowledge tests (MathQuiz). These games are designed to evaluate a range of capabilities such as strategic reasoning, planning, knowledge application, safety, and adaptability. Building upon recent studies that highlight the effectiveness of game-based evaluations for LLMs, ZeroSumEval enhances these approaches by providing a standardized and extensible framework for easily implementing games and leverages DSPy to provide a better abstraction for LLM player strategies.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

Mar-10-2025

arXiv.org PDF

Add feedback

Country:
- Asia
  - Middle East
    - Saudi Arabia > Asir Province
      - Abha (0.04)
    - Yemen > Amran Governorate
      - Amran (0.04)
  - Singapore (0.04)
- Europe > Ukraine
  - Kyiv Oblast > Kyiv (0.04)
- North America
  - Canada > Newfoundland and Labrador
    - Labrador (0.04)
  - United States > Texas (0.04)

Genre:
- Research Report (0.84)

Industry:
- Leisure & Entertainment > Games > Chess (0.49)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (0.47)
  - Natural Language > Large Language Model (1.00)