DAFE: LLM-Based Evaluation Through Dynamic Arbitration for Free-Form Question-Answering
–arXiv.org Artificial Intelligence
Evaluating Large Language Models (LLMs) free-form generated responses remains a challenge due to their diverse and open-ended nature. Traditional supervised signal-based automatic metrics fail to capture semantic equivalence or handle the variability of open-ended responses, while human evaluation, though reliable, is resource-intensive. Leveraging LLMs as evaluators offers a promising alternative due to their strong language understanding and instruction-following capabilities. Taking advantage of these capabilities, we propose the Dynamic Arbitration Framework for Evaluation (DAFE), which employs two primary LLM-as-judges and engages a third arbitrator only in cases of disagreements. This selective arbitration prioritizes evaluation reliability while reducing unnecessary computational demands compared to conventional majority voting. DAFE utilizes task-specific reference answers with dynamic arbitration to enhance judgment accuracy, resulting in significant improvements in evaluation metrics such as Macro F1 and Cohen's Kappa. Through experiments, including a comprehensive human evaluation, we demonstrate DAFE's ability to provide consistent, scalable, and resource-efficient assessments, establishing it as a robust framework for evaluating free-form model outputs.
arXiv.org Artificial Intelligence
Mar-11-2025
- Country:
- North America
- Dominican Republic (0.04)
- United States
- Pennsylvania (0.04)
- Florida > Miami-Dade County
- Miami (0.04)
- Canada > Ontario
- Toronto (0.04)
- Europe
- Germany (0.28)
- Monaco (0.04)
- United Kingdom > England (0.04)
- Spain > Catalonia
- Barcelona Province > Barcelona (0.04)
- Asia
- Singapore (0.04)
- Indonesia > Bali (0.04)
- Myanmar > Tanintharyi Region
- Dawei (0.04)
- Middle East
- Jordan (0.04)
- Yemen > Amran Governorate
- Amran (0.04)
- UAE > Abu Dhabi Emirate
- Abu Dhabi (0.04)
- Africa > Ethiopia
- Addis Ababa > Addis Ababa (0.04)
- North America
- Genre:
- Research Report > New Finding (1.00)
- Industry:
- Law > Alternative Dispute Resolution (1.00)
- Health & Medicine (1.00)
- Government (0.92)
- Technology: