Benchmarking Vision-Language Models on Chinese Ancient Documents: From OCR to Knowledge Reasoning

Yu, Haiyang, Wu, Yuchuan, Shi, Fan, Liao, Lei, Lu, Jinghui, Ge, Xiaodong, Wang, Han, Zhuo, Minghan, Wu, Xuecheng, Fei, Xiang, Feng, Hao, Tang, Guozhi, Wang, An-Lan, Zhu, Hanshen, He, Yangfan, Liang, Quanhuan, Meng, Liyuan, Feng, Chao, Huang, Can, Tang, Jingqun, Li, Bin

Sep-15-2025–arXiv.org Artificial Intelligence

Chinese ancient documents, invaluable carriers of millennia of Chinese history and culture, hold rich knowledge across diverse fields but face challenges in digitization and understanding--traditional methods only scan images, while current Vision-Language Models (VLMs) struggle with their visual/linguistic complexity. Existing document benchmarks focus on English printed texts or simplified Chinese, leaving a gap for evaluating VLMs on ancient Chinese documents. To address this, we present AncientDoc, the first benchmark for Chinese ancient documents, designed to assess VLMs from OCR to knowledge reasoning. AncientDoc includes five tasks (page-level OCR, vernacular translation, reasoning-based QA, knowledge-based QA, linguistic variant QA) and covers 14 document types, over 100 books, and about 3,000 pages. Based on AncientDoc, we evaluate mainstream VLMs using multiple metrics, supplemented by a human-aligned large language model for scoring. The benchmark are available at https://bytedance.github.io/AncientDoc.

large language model, machine learning, qwen2, (18 more...)

arXiv.org Artificial Intelligence

Sep-15-2025

arXiv.org PDF

Add feedback

Genre:
- Research Report (0.82)

Technology:
- Information Technology > Artificial Intelligence
  - Vision (1.00)
  - Natural Language > Large Language Model (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (0.52)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found