Grade Score: Quantifying LLM Performance in Option Selection
–arXiv.org Artificial Intelligence
Large Language Models (LLMs) have demonstrated remarkable intelligence and versatility in tasks related to logic, reasoning, and grading [4, 1, 7]. This has led to the increasing use of LLMs being the judges of arbitrary user presented options or at times judges of other LLMs themselves[11, 12]. However, previous research has highlighted that LLMs exhibit biases and a tendency to favor the first option presented to them. This paper explores various methods to mitigate order bias and improve the consistency of LLM judging. To facilitate progress in the study of LLM biases and consistency, we introduce a novel metric called the Grade Score, designed to quantify both the selection consistency and bias exhibited by an LLM, providing a comprehensive measure of an LLM's judging performance. A high score indicating a model that is highly consistent and fair in terms of order, while a low score suggests the presence of significant order bias or inconsistency in the model's choices. The Grade Score serves as a valuable tool for researchers and practitioners to assess and compare the performance of different LLMs in judging tasks. By quantifying the degree of instability and bias, the Grade Score enables the identification of models that exhibit superior judging capabilities and facilitates the development of techniques to mitigate biases and improve consistency.
arXiv.org Artificial Intelligence
Jun-20-2024
- Country:
- North America > United States
- Illinois > Champaign County > Urbana (0.04)
- Europe > Ukraine
- Kyiv Oblast > Kyiv (0.04)
- North America > United States
- Genre:
- Research Report (1.00)
- Technology: