Validating LLM-as-a-Judge Systems under Rating Indeterminacy
–Neural Information Processing Systems
The LLM-as-a-judge paradigm, in which a judge LLM system replaces human raters in rating the outputs of other generative AI (GenAI) systems, plays a critical role in scaling and standardizing GenAI evaluations. To validate such judge systems, evaluators assess human-judge agreement by first collecting multiple human ratings for each item in a validation corpus, then aggregating the ratings into a single, per-item gold label rating. For many items, however, rating criteria may admit multiple valid interpretations, so a human or LLM rater may deem multiple ratings "reasonable" or "correct". We call this condition rating indeterminacy. Problematically, many rating tasks that contain rating indeterminacy rely on forced-choice elicitation, whereby raters are instructed to select only one rating for each item.
Neural Information Processing Systems
Jun-21-2026, 02:01:30 GMT
- Genre:
- Overview (1.00)
- Research Report
- Experimental Study (1.00)
- New Finding (0.92)
- Industry:
- Banking & Finance (0.46)
- Government (0.45)
- Technology: