Evaluating LLMs at Detecting Errors in LLM Responses
Kamoi, Ryo, Das, Sarkar Snigdha Sarathi, Lou, Renze, Ahn, Jihyun Janice, Zhao, Yilun, Lu, Xiaoxin, Zhang, Nan, Zhang, Yusen, Zhang, Ranran Haoran, Vummanthala, Sujeeth Reddy, Dave, Salika, Qin, Shaobo, Cohan, Arman, Yin, Wenpeng, Zhang, Rui
–arXiv.org Artificial Intelligence
With Large Language Models (LLMs) being widely used across various tasks, detecting errors in their responses is increasingly crucial. However, little research has been conducted on error detection of LLM responses. Collecting error annotations on LLM responses is challenging due to the subjective nature of many NLP tasks, and thus previous research focuses on tasks of little practical value (e.g., word sorting) or limited error types (e.g., faithfulness in summarization). This work introduces ReaLMistake, the first error detection benchmark consisting of objective, realistic, and diverse errors made by LLMs. ReaLMistake contains three challenging and meaningful tasks that introduce objectively assessable errors in four categories (reasoning correctness, instruction-following, context-faithfulness, and parameterized knowledge), eliciting naturally observed and diverse errors in responses of GPT-4 and Llama 2 70B annotated by experts. We use ReaLMistake to evaluate error detectors based on 12 LLMs. Our findings show: 1) Top LLMs like GPT-4 and Claude 3 detect errors made by LLMs at very low recall, and all LLM-based error detectors perform much worse than humans.
arXiv.org Artificial Intelligence
Apr-4-2024
- Country:
- Europe > United Kingdom
- England (0.14)
- North America > United States (0.92)
- Europe > United Kingdom
- Genre:
- Research Report > New Finding (1.00)
- Industry:
- Leisure & Entertainment > Sports (1.00)
- Technology: