Long-Form Information Alignment Evaluation Beyond Atomic Facts

Zheng, Danna, Lapata, Mirella, Pan, Jeff Z.

May-22-2025–arXiv.org Artificial Intelligence

Information alignment evaluators are vital for various NLG evaluation tasks and trustworthy LLM deployment, reducing hallucinations and enhancing user trust. Current fine-grained methods, like FactScore, verify facts individually but neglect inter-fact dependencies, enabling subtle vulnerabilities. In this work, we introduce MontageLie, a challenging benchmark that constructs deceptive narratives by "montaging" truthful statements without introducing explicit hallucinations. We demonstrate that both coarse-grained LLM-based evaluators and current fine-grained frameworks are susceptible to this attack, with AUC-ROC scores falling below 65%. To enable more robust fine-grained evaluation, we propose DoveScore, a novel framework that jointly verifies factual accuracy and event-order consistency. By modeling inter-fact relationships, DoveScore outperforms existing fine-grained methods by over 8%, providing a more robust solution for long-form text alignment evaluation. Our code and datasets are available at https://github.com/dannalily/DoveScore.

computational linguistic, large language model, machine learning, (20 more...)

arXiv.org Artificial Intelligence

May-22-2025

arXiv.org PDF

Add feedback

Country:
- North America > United States (1.00)
- Asia > Middle East
  - UAE (0.46)

Genre:
- Research Report > New Finding (0.67)

Industry:
- Media (1.00)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Large Language Model (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (0.30)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found