Rating Roulette: Self-Inconsistency in LLM-As-A-Judge Frameworks

Nov-3-2025–arXiv.org Artificial Intelligence

As Natural Language Generation (NLG) continues to be widely adopted, properly assessing it has become quite difficult. Lately, using large language models (LLMs) for evaluating these generations has gained traction, as they tend to align more closely with human preferences than conventional n-gram or embedding-based metrics. In our experiments, we show that LLM judges have low intra-rater reliability in their assigned scores across different runs. This variance makes their ratings inconsistent, almost arbitrary in the worst case, making it difficult to measure how good their judgments actually are. We quantify this inconsistency across different NLG tasks and benchmarks and see if judicious use of LLM judges can still be useful following proper guidelines.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

Nov-3-2025

arXiv.org PDF

Add feedback

Country:
- North America > United States (1.00)
- Europe (1.00)

Genre:
- Research Report > New Finding (0.88)

Industry:
- Health & Medicine (0.67)
- Leisure & Entertainment (0.46)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Large Language Model (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (0.50)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found