Wukong-Reader: Multi-modal Pre-training for Fine-grained Visual Document Understanding

Bai, Haoli, Liu, Zhiguang, Meng, Xiaojun, Li, Wentao, Liu, Shuang, Xie, Nian, Zheng, Rongfu, Wang, Liangwei, Hou, Lu, Wei, Jiansheng, Jiang, Xin, Liu, Qun

Dec-19-2022–arXiv.org Artificial Intelligence

Unsupervised pre-training on millions of digital-born or scanned documents has shown promising advances in visual document understanding~(VDU). While various vision-language pre-training objectives are studied in existing solutions, the document textline, as an intrinsic granularity in VDU, has seldom been explored so far. A document textline usually contains words that are spatially and semantically correlated, which can be easily obtained from OCR engines. In this paper, we propose Wukong-Reader, trained with new pre-training objectives to leverage the structural knowledge nested in document textlines. We introduce textline-region contrastive learning to achieve fine-grained alignment between the visual regions and texts of document textlines. Furthermore, masked region modeling and textline-grid matching are also designed to enhance the visual and layout representations of textlines. Experiments show that our Wukong-Reader has superior performance on various VDU tasks such as information extraction. The fine-grained alignment over textlines also empowers Wukong-Reader with promising localization ability.

machine learning, natural language, pattern recognition, (19 more...)

arXiv.org Artificial Intelligence

Dec-19-2022

arXiv.org PDF

Add feedback

Genre:
- Research Report (0.50)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Pattern Recognition (0.46)
  - Natural Language
    - Information Extraction (0.35)
    - Text Processing (0.46)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found