A Judge-free LLM Open-ended Generation Benchmark Based on the Distributional Hypothesis

Imajo, Kentaro, Hirano, Masanori, Suzuki, Shuji, Mikami, Hiroaki

Feb-13-2025–arXiv.org Artificial Intelligence

Evaluating the open-ended text generation of large language models (LLMs) is challenging because of the lack of a clear ground truth and the high cost of human or LLM-based assessments. We propose a novel benchmark that evaluates LLMs using n-gram statistics and rules, without relying on human judgement or LLM-as-a-judge approaches. Using 50 question and reference answer sets, we introduce three new metrics based on n-grams and rules: Fluency, Truthfulness, and Helpfulness. Our benchmark strongly correlates with GPT-4o-based evaluations while requiring significantly fewer computational resources, demonstrating its effectiveness as a scalable alternative for assessing LLMs' open-ended generation capabilities.

benchmark, large language model, machine learning, (20 more...)

arXiv.org Artificial Intelligence

Feb-13-2025

arXiv.org PDF

Add feedback

Country:
- Asia (0.28)

Genre:
- Research Report > New Finding (1.00)

Industry:
- Energy > Renewable (1.00)
- Law (0.68)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)
  - Natural Language > Large Language Model (1.00)