VisRAG: Vision-based Retrieval-augmented Generation on Multi-modality Documents

Yu, Shi, Tang, Chaoyue, Xu, Bokai, Cui, Junbo, Ran, Junhao, Yan, Yukun, Liu, Zhenghao, Wang, Shuo, Han, Xu, Liu, Zhiyuan, Sun, Maosong

Oct-14-2024–arXiv.org Artificial Intelligence

Retrieval-augmented generation (RAG) is an effective technique that enables large language models (LLMs) to utilize external knowledge sources for generation. However, current RAG systems are solely based on text, rendering it impossible to utilize vision information like layout and images that play crucial roles in realworld multi-modality documents. In this paper, we introduce VisRAG, which tackles this issue by establishing a vision-language model (VLM)-based RAG pipeline. In this pipeline, instead of first parsing the document to obtain text, the document is directly embedded using a VLM as an image and then retrieved to enhance the generation of a VLM. Compared to traditional text-based RAG, VisRAG maximizes the retention and utilization of the data information in the original documents, eliminating the information loss introduced during the parsing process. We collect both open-source and synthetic data to train the retriever in VisRAG and explore a variety of generation methods. Experiments demonstrate that VisRAG outperforms traditional RAG in both the retrieval and generation stages, achieving a 25-39% end-to-end performance gain over traditional textbased RAG pipeline. Further analysis reveals that VisRAG is effective in utilizing training data and demonstrates strong generalization capability, positioning it as a promising solution for RAG on multi-modality documents. Our code and data are available at https://github.com/openbmb/visrag. Trained on massive data, large language models (LLMs) like GPT-4 (Achiam et al., 2023) have shown strong abilities in common NLP tasks using their parametric knowledge (Wei et al., 2022; Zhao et al., 2023). Retrieval-augmented generation (RAG) alleviates this problem by using a knowledge retriever, which has access to a custom outer knowledge base, to supply the LLM with the necessary information for generating outputs (Guu et al., 2020; Lewis et al., 2020; Yu et al., 2023). Opensource RAG frameworks like llamaindex (Liu, 2022) have been developed to facilitate the research and deployment of common RAG pipelines. Typical retrieval-augmented generation (RAG) pipelines are text-based, operating on segmented texts as retrieval units (Yu et al., 2023; Asai et al., 2024; Yan et al., 2024), which we refer to as TextRAG.

arxiv preprint arxiv, large language model, machine learning, (18 more...)

arXiv.org Artificial Intelligence

Oct-14-2024

arXiv.org PDF

Add feedback

Country:
- North America > United States (0.67)

Genre:
- Research Report > Promising Solution (0.34)

Industry:
- Transportation > Air (0.93)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)
  - Natural Language
    - Chatbot (1.00)
    - Large Language Model (1.00)