DocGraphLM: Documental Graph Language Model for Information Extraction

Wang, Dongsheng, Ma, Zhiqiang, Nourbakhsh, Armineh, Gu, Kang, Shah, Sameena

arXiv.org Artificial Intelligence 

Advances in Visually Rich Document Understanding (VrDU) have Information extraction from visually-rich documents (VrDs), such enabled information extraction and question answering over documents as business forms, receipts, and invoices in the format of PDF or with complex layouts. Two tropes of architectures have image has gained recent traction. Tasks such as field identification emerged--transformer-based models inspired by LLMs, and Graph and extraction and entity linkage are crucial to digitizing VrDs Neural Networks. In this paper, we introduce DocGraphLM, a novel and building information retrieval systems on the data. Tasks that framework that combines pre-trained language models with graph require complex reasoning such as Visual Question Answering semantics. To achieve this, we propose 1) a joint encoder architecture over documents require modeling the spatial, visual, and semantic to represent documents, and 2) a novel link prediction approach signals in VrDs. Therefore, VrD Understanding is concerned with to reconstruct document graphs. DocGraphLM predicts both directions modeling the multi-modal content in image documents. Previous and distances between nodes using a convergent joint loss research has explored the use of encoding text, layout, and image function that prioritizes neighborhood restoration and downweighs features in a layout language model or multi-modal setting to improve distant node detection.