LongDocURL: a Comprehensive Multimodal Long Document Benchmark Integrating Understanding, Reasoning, and Locating

Deng, Chao, Yuan, Jiale, Bu, Pi, Wang, Peijie, Li, Zhong-Zhi, Xu, Jian, Li, Xiao-Hui, Gao, Yuan, Song, Jun, Zheng, Bo, Liu, Cheng-Lin

Dec-27-2024–arXiv.org Artificial Intelligence

Large vision language models (LVLMs) have improved the document understanding capabilities remarkably, enabling the handling of complex document elements, longer contexts, and a wider range of tasks. However, existing document understanding benchmarks have been limited to handling only a small number of pages and fail to provide a comprehensive analysis of layout elements locating. In this paper, we first define three primary task categories: Long Document Understanding, numerical Reasoning, and cross-element Locating, and then propose a comprehensive benchmark, LongDocURL, integrating above three primary tasks and comprising 20 sub-tasks categorized based on different primary tasks and answer evidences. Furthermore, we develop a semi-automated construction pipeline and collect 2,325 high-quality question-answering pairs, covering more than 33,000 pages of documents, significantly outperforming existing benchmarks. Subsequently, we conduct comprehensive evaluation experiments on both open-source and closed-source models across 26 different configurations, revealing critical performance gaps in this field.

information, large language model, machine learning, (23 more...)

arXiv.org Artificial Intelligence

Dec-27-2024

arXiv.org PDF

Add feedback

Country:
- Africa (0.04)
- Asia > South Korea (0.04)
- Europe
  - Austria > Vienna (0.14)
  - France > Île-de-France
    - Paris > Paris (0.04)
  - Greece > Attica
    - Athens (0.04)
  - Ireland > Leinster
    - County Dublin > Dublin (0.04)
  - Norway (0.04)
  - Portugal > Lisbon
    - Lisbon (0.04)
- North America
  - Canada > Quebec
    - Montreal (0.04)
  - United States
    - Louisiana > Orleans Parish
      - New Orleans (0.04)
    - Washington > King County
      - Seattle (0.04)
- Oceania > Australia
  - Victoria > Melbourne (0.04)

Genre:
- Research Report > New Finding (0.67)

Industry:
- Law (0.46)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (0.95)
  - Natural Language
    - Chatbot (0.95)
    - Large Language Model (1.00)