OpeNLGauge: An Explainable Metric for NLG Evaluation with Open-Weights LLMs

Kartáč, Ivan, Lango, Mateusz, Dušek, Ondřej

Mar-14-2025–arXiv.org Artificial Intelligence

Large Language Models (LLMs) have demonstrated great potential as evaluators of NLG systems, allowing for high-quality, reference-free, and multi-aspect assessments. However, existing LLM-based metrics suffer from two major drawbacks: reliance on proprietary models to generate training data or perform evaluations, and a lack of fine-grained, explanatory feedback. In this paper, we introduce OpeNLGauge, a fully open-source, reference-free NLG evaluation metric that provides accurate explanations based on error spans. OpeNLGauge is available as a two-stage ensemble of larger open-weight LLMs, or as a small fine-tuned evaluation model, with confirmed generalizability to unseen tasks, domains and aspects. Our extensive meta-evaluation shows that OpeNLGauge achieves competitive correlation with human judgments, outperforming state-of-the-art models on certain tasks while maintaining full reproducibility and providing explanations more than twice as accurate.

computational linguistic, dataset, proceedings, (13 more...)

arXiv.org Artificial Intelligence

Mar-14-2025

arXiv.org PDF

Add feedback

Country:
- North America
  - Dominican Republic (0.04)
  - United States
    - Pennsylvania (0.04)
    - Minnesota > Hennepin County
      - Minneapolis (0.14)
    - Massachusetts > Middlesex County
      - Cambridge (0.04)
    - Louisiana > Orleans Parish
      - New Orleans (0.04)
    - Florida > Miami-Dade County
      - Miami (0.04)
    - California > San Francisco County
      - San Francisco (0.04)
  - Mexico > Mexico City
    - Mexico City (0.04)
  - Canada
    - Quebec > Montreal (0.04)
    - Ontario > Toronto (0.04)
- Europe
  - Netherlands (0.04)
  - Monaco (0.04)
  - Czechia > Prague (0.04)
  - Spain
    - Galicia > Madrid (0.04)
    - Catalonia > Barcelona Province
      - Barcelona (0.04)
  - Middle East > Malta
    - Eastern Region > Northern Harbour District > St. Julian's (0.04)
  - Ireland > Leinster
    - County Dublin > Dublin (0.04)
  - Finland > Pirkanmaa
    - Tampere (0.04)
  - Denmark > Capital Region
    - Copenhagen (0.04)
- Asia
  - Singapore (0.04)
  - British Indian Ocean Territory > Diego Garcia (0.04)
  - Thailand > Bangkok
    - Bangkok (0.04)
  - Middle East
    - Jordan (0.04)
    - UAE > Abu Dhabi Emirate
      - Abu Dhabi (0.04)
    - Saudi Arabia > Asir Province
      - Abha (0.04)
  - China
    - Shanghai > Shanghai (0.04)
    - Hong Kong (0.04)

Genre:
- Research Report (1.00)

Industry:
- Leisure & Entertainment (0.45)
- Transportation
  - Infrastructure & Services (0.46)
  - Air (0.46)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Large Language Model (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found