SCORE: A Semantic Evaluation Framework for Generative Document Parsing

Li, Renyu, Yepes, Antonio Jimeno, You, Yao, Pluciński, Kamil, Operlejn, Maximilian, Wolfe, Crag

Sep-25-2025–arXiv.org Artificial Intelligence

Traditional document parsing architectures employ deterministic pipelines that sequentially combine optical character recognition (OCR), layout analysis, and rule-based table extraction to produce structured outputs. The evaluation of these systems has relied on well-established task-specific metrics including Character Error Rate (CER) and Word Error Rate (WER) [14, 20], Intersection-over-Union (IoU) [4, 16], and Tree Edit Distance-based Similarity (TEDS) [31]. These metrics operate under the assumption of unique ground truth representations, rewarding exact matches while systematically penalizing any structural deviations. The emergence of multi-modal generative document parsing systems has fundamentally transformed this landscape. Vision Language Models (VLMs) such as GPT-5 Mini, Gemini 2.5 Flash, and Claude Sonnet 3.7/4 [22, 6, 1, 2], generate holistic document interpretations that integrate visual, textual, and structural signals in an end-to-end manner. Unlike their deterministic predecessors, these systems frequently produce outputs that are semantically correct yet structurally divergent. Consider a table containing merged cells: one system may represent it as a flattened token sequence preserving reading order, while another generates hierarchical HTML markup with explicit structural relationships. Both interpretations faithfully capture the semantic content, yet traditional evaluation frameworks treat them as fundamentally incompatible, systematically misclassifying valid alternative interpretations as parsing errors. This mismatch of the evaluation paradigm has significant practical implications.

large language model, machine learning, natural language, (22 more...)

arXiv.org Artificial Intelligence

Sep-25-2025

arXiv.org PDF

Add feedback

Genre:
- Research Report > New Finding (0.67)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language
    - Large Language Model (1.00)
    - Grammars & Parsing (1.00)
    - Chatbot (0.90)
  - Machine Learning
    - Performance Analysis > Accuracy (1.00)
    - Neural Networks > Deep Learning (1.00)