QUEST: Quality-aware Semi-supervised Table Extraction for Business Documents

Thomas, Eliott, Coustaty, Mickael, Joseph, Aurelie, Deloin, Gaspar, Carel, Elodie, D'Andecy, Vincent Poulain, Ogier, Jean-Marc

Jun-24-2025–arXiv.org Artificial Intelligence

Automating table extraction (TE) from business documents is critical for industrial workflows but remains challenging due to sparse annotations and error-prone multi-stage pipelines. While semi-supervised learning (SSL) can leverage unlabeled data, existing methods rely on confidence scores that poorly reflect extraction quality. We propose QUEST, a Quality-aware Semi-supervised Table extraction framework designed for business documents. QUEST introduces a novel quality assessment model that evaluates structural and contextual features of extracted tables, trained to predict F1 scores instead of relying on confidence metrics. This quality-aware approach guides pseudo-label selection during iterative SSL training, while diversity measures (DPP, Vendi score, IntDiv) mitigate confirmation bias. Experiments on a proprietary business dataset (1000 annotated + 10000 unannotated documents) show QUEST improves F1 from 64% to 74% and reduces empty predictions by 45% (from 12% to 6.5%). On the DocILE benchmark (600 annotated + 20000 unannotated documents), QUEST achieves a 50% F1 score (up from 42%) and reduces empty predictions by 19% (from 27% to 22%). The framework's interpretable quality assessments and robustness to annotation scarcity make it particularly suited for business documents, where structural consistency and data completeness are paramount.

information retrieval, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

Jun-24-2025

arXiv.org PDF

Add feedback

Genre:
- Workflow (0.66)
- Research Report (0.64)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Information Retrieval (0.56)
  - Machine Learning
    - Inductive Learning (0.89)
    - Unsupervised or Indirectly Supervised Learning (0.56)
    - Neural Networks > Deep Learning (0.46)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found