Validating LLM-as-a-Judge Systems under Rating Indeterminacy

Jun-21-2026, 02:01:30 GMT–Neural Information Processing Systems

The LLM-as-a-judge paradigm, in which a judge LLM system replaces human raters in rating the outputs of other generative AI (GenAI) systems, plays a critical role in scaling and standardizing GenAI evaluations. To validate such judge systems, evaluators assess human-judge agreement by first collecting multiple human ratings for each item in a validation corpus, then aggregating the ratings into a single, per-item gold label rating. For many items, however, rating criteria may admit multiple valid interpretations, so a human or LLM rater may deem multiple ratings "reasonable" or "correct". We call this condition rating indeterminacy. Problematically, many rating tasks that contain rating indeterminacy rely on forced-choice elicitation, whereby raters are instructed to select only one rating for each item.

large language model, machine learning, natural language, (17 more...)

Neural Information Processing Systems

Jun-21-2026, 02:01:30 GMT

Conferences PDF

Add feedback

Genre:
- Overview (1.00)
- Research Report
  - Experimental Study (1.00)
  - New Finding (0.92)

Industry:
- Banking & Finance (0.46)
- Government (0.45)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Large Language Model (1.00)
  - Machine Learning
    - Performance Analysis > Accuracy (0.92)
    - Neural Networks > Deep Learning
      - Generative AI (0.34)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found