JudgeBench: A Benchmark for Evaluating LLM-based Judges
Tan, Sijun, Zhuang, Siyuan, Montgomery, Kyle, Tang, William Y., Cuadron, Alejandro, Wang, Chenguang, Popa, Raluca Ada, Stoica, Ion
–arXiv.org Artificial Intelligence
LLM-based judges have emerged as a scalable alternative to human evaluation and are increasingly used to assess, compare, and improve models. However, the reliability of LLM-based judges themselves is rarely scrutinized. As LLMs become more advanced, their responses grow more sophisticated, requiring stronger judges to evaluate them. Existing benchmarks primarily focus on a judge's alignment with human preferences, but often fail to account for more challenging tasks where crowdsourced human preference is a poor indicator of factual and logical correctness. To address this, we propose a novel evaluation framework to objectively evaluate LLM-based judges. Based on this framework, we propose JudgeBench, a benchmark for evaluating LLM-based judges on challenging response pairs spanning knowledge, reasoning, math, and coding. Our comprehensive evaluation on a collection of prompted judges, fine-tuned judges, multi-agent judges, and reward models shows that JudgeBench poses a significantly greater challenge than previous benchmarks, with many strong models (e.g., GPT-4o) performing just slightly better than random guessing. Overall, JudgeBench offers a reliable platform for assessing increasingly advanced LLM-based judges. Data and code are available at https://github.com/ Large Language Models (LLMs) have demonstrated remarkable success in recent years and are still evolving at a rapid pace. With more advanced AI models coming out every month, a central challenge is how to evaluate, compare, and supervise these models. While human judgments have traditionally been the gold standard in evaluating and supervising language models, collecting human judgments is often costly and time-consuming. As an alternative, using LLM-based judges (Zheng et al., 2024) has become a scalable paradigm in addressing this limitation, and has been increasingly adopted to evaluate and rank models. Moreover, these LLM-based judges are now integral to enhancing models' capability, serving as reward models during training (Yuan et al., 2024; Luo et al., 2024a), and acting as verifiers during inference to select the best response from multiple candidates (Cobbe et al., 2021; Lightman et al., 2023). Despite the widespread adoption, a fundamental question remains: How reliable are these LLMbased judges themselves? Since LLMs themselves are prone to make logical and factual mistakes, how can we trust that LLM-based judges are accurate and objective? To evaluate LLM-based judges, many prior works have focused on these judges' agreement with human preference (Dubois et al., 2024; Zheng et al., 2024; Zhang et al., 2023; Wang et al., 2023a). The core assumption implied in these works is that crowdsourced human annotators will evaluate the responses objectively and not make mistakes. Prompt: Rewrite the sentence using gender-neutral language: A salesman is giving a presentation. A salesperson is giving a presentation.
arXiv.org Artificial Intelligence
Oct-16-2024