SummExecEdit: A Factual Consistency Benchmark in Summarization with Executable Edits

Thorat, Onkar, Laban, Philippe, Wu, Chien-Sheng

Dec-17-2024–arXiv.org Artificial Intelligence

Detecting factual inconsistencies in summarization is critical, yet existing benchmarks lack the necessary challenge and interpretability for robust evaluation. In this paper, we introduce SummExecEdit, a novel benchmark leveraging executable edits to assess models on their ability to both detect factual errors and provide accurate explanations. The top-performing model, Claude3-Opus, achieves a joint detection and explanation score of only 0.49 in our benchmark, with individual scores of 0.67 for detection and 0.73 for explanation. Furthermore, we identify four primary types of explanation errors, with 45.4% of errors focusing on completely unrelated parts of the summary.

benchmark, explanation, inconsistency, (16 more...)

arXiv.org Artificial Intelligence

Dec-17-2024

arXiv.org PDF

Add feedback

Country:
- Asia > Singapore (0.04)
- North America
  - United States (0.04)
  - Canada > Ontario
    - Toronto (0.04)
- Europe > Italy
  - Tuscany > Florence (0.04)

Genre:
- Research Report (0.82)

Industry:
- Information Technology (0.46)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language
    - Large Language Model (1.00)
    - Chatbot (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (0.74)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found