An Empirical Analysis of Uncertainty in Large Language Model Evaluations

Xie, Qiujie, Li, Qingqiu, Yu, Zhuohao, Zhang, Yuejie, Zhang, Yue, Yang, Linyi

Feb-15-2025–arXiv.org Artificial Intelligence

As LLM-as-a-Judge emerges as a new paradigm for assessing large language models (LLMs), concerns have been raised regarding the alignment, bias, and stability of LLM evaluators. While substantial work has focused on alignment and bias, little research has concentrated on the stability of LLM evaluators. In this paper, we conduct extensive experiments involving 9 widely used LLM evaluators across 2 different evaluation settings to investigate the uncertainty in model-based LLM evaluations. We pinpoint that LLM evaluators exhibit varying uncertainty based on model families and sizes. With careful comparative analyses, we find that employing special prompting strategies, whether during inference or post-training, can alleviate evaluation uncertainty to some extent. By utilizing uncertainty to enhance LLM's reliability and detection capability in Out-Of-Distribution (OOD) data, we further fine-tune an uncertainty-aware LLM evaluator named ConfiLM using a human-annotated fine-tuning set and assess ConfiLM's OOD evaluation ability on a manually designed test set sourced from the 2024 Olympics. Experimental results demonstrate that incorporating uncertainty as additional information during the fine-tuning phase can largely improve the model's evaluation performance in OOD scenarios. Candidate A's reponse is better Describe a unique trait of the raccoon A unique trait of a Raccoon is its ability to open and close its eyes while they are closed. The evaluation process is influenced by the uncertainty of both the evaluator and the candidate model. Large language models (LLMs) have garnered increasing attention due to their unprecedented performance in various real-world applications (Zhao et al., 2023; Wang et al., 2024a). In this context, how to accurately assess the performance of a LLM becomes particularly important. This area of research includes benchmark-based evaluation, model-based evaluation, and human evaluation (Chang et al., 2024). While various benchmarks (Zellers et al., 2019; Hendrycks et al., 2021; Yang et al., 2023; Xie et al., 2024) have been proposed to measure the core abilities of LLMs in comprehension and generation, human evaluation remains the gold standard for testing overall performance due to its complexity and open-endless. However, this approach is limited by subjectivity issue (Krishna et al., 2023) and resource costs (Karpinska et al., 2021). Experiment 1: Does uncertainty exist in LLM evaluator? Experiment 2: How to mitigate uncertainty? Experiment 3: Can we utilize uncertainty? Figure 2: We conduct extensive experiments and analysis to investigate the existence, mitigation and utilization of uncertainty in model-based LLM evaluation. Uncertainty plays a key role in the evaluation process and can be leveraged to enhance the evaluator's performance in OOD scenarios. As LLM-as-a-Judge gains more attention, criticism has also emerged (Thakur et al., 2024).

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

Feb-15-2025

arXiv.org PDF

Add feedback

Country:
- Europe (1.00)

Genre:
- Research Report > New Finding (1.00)

Industry:
- Leisure & Entertainment > Sports > Olympic Games (1.00)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)
  - Natural Language > Large Language Model (1.00)