MixEval: Deriving Wisdom of the Crowd from LLM Benchmark Mixtures

Mar-22-2026, 03:02:01 GMT–Neural Information Processing Systems

Evaluating large language models (LLMs) is challenging. Traditional ground-truth-based benchmarks fail to capture the comprehensiveness and nuance of real-world queries, while LLM-as-judge benchmarks suffer from grading biases and limited query quantity. Both of them may also become contaminated over time. User-facing evaluation, such as Chatbot Arena, provides reliable signals but is costly and slow. In this work, we propose MixEval, a new paradigm for establishing efficient, gold-standard LLM evaluation by strategically mixing off-the-shelf benchmarks.

artificial intelligence, large language model, natural language, (9 more...)

Neural Information Processing Systems

Mar-22-2026, 03:02:01 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)