Grade Score: Quantifying LLM Performance in Option Selection

Iourovitski, Dmitri

arXiv.org Artificial Intelligence 

Large Language Models (LLMs) have demonstrated remarkable intelligence and versatility in tasks related to logic, reasoning, and grading [4, 1, 7]. This has led to the increasing use of LLMs being the judges of arbitrary user presented options or at times judges of other LLMs themselves[11, 12]. However, previous research has highlighted that LLMs exhibit biases and a tendency to favor the first option presented to them. This paper explores various methods to mitigate order bias and improve the consistency of LLM judging. To facilitate progress in the study of LLM biases and consistency, we introduce a novel metric called the Grade Score, designed to quantify both the selection consistency and bias exhibited by an LLM, providing a comprehensive measure of an LLM's judging performance. A high score indicating a model that is highly consistent and fair in terms of order, while a low score suggests the presence of significant order bias or inconsistency in the model's choices. The Grade Score serves as a valuable tool for researchers and practitioners to assess and compare the performance of different LLMs in judging tasks. By quantifying the degree of instability and bias, the Grade Score enables the identification of models that exhibit superior judging capabilities and facilitates the development of techniques to mitigate biases and improve consistency.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found