A Step Towards Mixture of Grader: Statistical Analysis of Existing Automatic Evaluation Metrics

Soh, Yun Joon, Zhao, Jishen

arXiv.org Artificial Intelligence 

Large Language Models (LLMs) are widely adopted across various tasks including Question-Answering (QA) tasks. More and more models including the fine-tuned models and the dataset used for fine-tuning are being released daily. This explosion in the number of models, and datasets emphasizes the importance of accurate automatic evaluation for out-of-the-model language model training as well as gauging their QA capabilities. However, varying question types (short-form, long-form, open-ended, etc.) and ambiguity in the grading rubric make it difficult to properly gauge each model's capability objectively for QA tasks. No single existing evaluation metric can capture the language model's QA answer capability for multiple quality types. For example, Exact Match (EM) is a widely adopted all-or-nothing evaluation metric that shows a high correlation with human-evaluated scores for short-form QA tasks but is too strict to give credit for any semantically identical answer. The lack of an objective grading rubric for varying QA types creates a bias in summary statistics. For example, half credit for an open-ended question is regarded equally as half credit for a simple factual question. In this paper, we (1) deploy statistical approaches to characterize various existing evaluation metrics, (2) the effectiveness of recent ChatGPT-o1-preview model [6] as QA grader, and (3) potential solution, a Mixture Of Grader (MOG), which first classifies each (question, gold answer) pair into one of the predefined QA type class and select the appropriate evaluation metric accordingly for an advanced automatic evaluation that better "correlates" to human evaluator.