Accurate and Nuanced Open-QA Evaluation Through Textual Entailment

May-26-2024–arXiv.org Artificial Intelligence

Open-domain question answering (Open-QA) is a common task for evaluating large language models (LLMs). However, current Open-QA evaluations are criticized for the ambiguity in questions and the lack of semantic understanding in evaluators. Complex evaluators, powered by foundation models or LLMs and pertaining to semantic equivalence, still deviate from human judgments by a large margin. We propose to study the entailment relations of answers to identify more informative and more general system answers, offering a much closer evaluation to human judgment on both NaturalQuestions and TriviaQA while being learning-free. The entailment-based evaluation we propose allows the assignment of bonus or partial marks by quantifying the inference gap between answers, enabling a nuanced ranking of answer correctness that has higher AUC than current methods.

computational linguistic, gold answer, system answer, (14 more...)

arXiv.org Artificial Intelligence

May-26-2024

arXiv.org PDF

Add feedback

Country:
- North America
  - Dominican Republic (0.04)
  - United States
    - Nevada > Clark County
      - Las Vegas (0.04)
    - Louisiana > Orleans Parish
      - New Orleans (0.04)
  - Canada
    - Alberta (0.14)
    - Nova Scotia (0.05)
    - Ontario > Toronto (0.04)
- Europe
  - Italy > Tuscany
    - Florence (0.04)
  - Greece > Attica
    - Athens (0.04)
- Asia
  - Singapore (0.04)
  - China > Hong Kong (0.04)
  - Middle East
    - Jordan (0.04)
    - UAE > Abu Dhabi Emirate
      - Abu Dhabi (0.04)
- Africa
  - Rwanda > Kigali
    - Kigali (0.04)
  - Ethiopia > Addis Ababa
    - Addis Ababa (0.04)

Genre:
- Research Report (0.50)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Large Language Model (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (0.73)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found