LLM-Rubric: A Multidimensional, Calibrated Approach to Automated Evaluation of Natural Language Texts

Hashemi, Helia, Eisner, Jason, Rosset, Corby, Van Durme, Benjamin, Kedzie, Chris

Dec-30-2024–arXiv.org Artificial Intelligence

This paper introduces a framework for the automated evaluation of natural language texts. A manually constructed rubric describes how to assess multiple dimensions of interest. To evaluate a text, a large language model (LLM) is prompted with each rubric question and produces a distribution over potential responses. The LLM predictions often fail to agree well with human judges -- indeed, the humans do not fully agree with one another. However, the multiple LLM distributions can be $\textit{combined}$ to $\textit{predict}$ each human judge's annotations on all questions, including a summary question that assesses overall quality or relevance. LLM-Rubric accomplishes this by training a small feed-forward neural network that includes both judge-specific and judge-independent parameters. When evaluating dialogue systems in a human-AI information-seeking task, we find that LLM-Rubric with 9 questions (assessing dimensions such as naturalness, conciseness, and citation quality) predicts human judges' assessment of overall user satisfaction, on a scale of 1--4, with RMS error $< 0.5$, a $2\times$ improvement over the uncalibrated baseline.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

Dec-30-2024

arXiv.org PDF

Add feedback

Country:
- Europe (0.92)
- North America > United States (1.00)

Genre:
- Questionnaire & Opinion Survey (1.00)
- Research Report > Experimental Study (0.46)

Industry:
- Education
  - Curriculum > Subject-Specific Education (0.45)
  - Educational Setting (0.67)
- Health & Medicine (1.00)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (0.68)
  - Natural Language > Large Language Model (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found