GRAM: Global Reasoning for Multi-Page VQA

Blau, Tsachi, Fogel, Sharon, Ronen, Roi, Golts, Alona, Ganz, Roy, Avraham, Elad Ben, Aberdam, Aviad, Tsiper, Shahar, Litman, Ron

Jan-7-2024–arXiv.org Artificial Intelligence

The increasing use of transformer-based large language models brings forward the challenge of processing long sequences. In document visual question answering (DocVQA), leading methods focus on the single-page setting, while documents can span hundreds of pages. We present GRAM, a method that seamlessly extends pre-trained single-page models to the multi-page setting, without requiring computationally-heavy pretraining. To do so, we leverage a single-page encoder for local page-level understanding, and enhance it with document-level designated layers and learnable tokens, facilitating the flow of information across pages for global reasoning. To enforce our model to utilize the newly introduced document-level tokens, we propose a tailored bias adaptation method. For additional computational savings during decoding, we introduce an optional compression stage using our C-Former model, which reduces the encoded sequence length, thereby allowing a tradeoff between quality and latency. Extensive experiments showcase GRAM's state-of-the-art performance on the benchmarks for multi-page DocVQA, demonstrating the effectiveness of our approach.

information, large language model, machine learning, (20 more...)

arXiv.org Artificial Intelligence

Jan-7-2024

arXiv.org PDF

Add feedback

Country:
- Africa (1.00)
- Asia > Middle East (0.93)
- Europe (1.00)
- North America > United States
  - California > San Francisco County
    - San Francisco (0.14)
  - New Jersey (1.00)
  - New York (0.68)
  - Texas (1.00)

Genre:
- Research Report (1.00)

Industry:
- Aerospace & Defense (1.00)
- Education (0.67)
- Government
  - Regional Government > North America Government
    - United States Government (1.00)
  - Space Agency (1.00)
- Health & Medicine > Therapeutic Area (0.92)
- Law (1.00)
- Law Enforcement & Public Safety > Crime Prevention & Enforcement (0.92)
- Transportation (0.67)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (0.48)
  - Natural Language > Large Language Model (1.00)