Radically Lower Data-Labeling Costs for Visually Rich Document Extraction Models

Zhou, Yichao, Wendt, James B., Potti, Navneet, Xie, Jing, Tata, Sandeep

Oct-28-2022–arXiv.org Artificial Intelligence

A key bottleneck in building automatic extraction models for visually rich documents like invoices is the cost of acquiring the several thousand high-quality labeled documents that are needed to train a model with acceptable accuracy. We propose Selective Labeling to simplify the labeling task to provide "yes/no" labels for candidate extractions predicted by a model trained on partially labeled documents. We combine this with a custom active learning strategy to find the predictions that the model is most uncertain about. We show through experiments on document types drawn from 3 different domains that selective labeling can reduce the cost of acquiring labeled data by $10\times$ with a negligible loss in accuracy.

data mining, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

Oct-28-2022

arXiv.org PDF

Add feedback

Country:
- North America
  - United States
    - Wisconsin > Dane County
      - Madison (0.04)
    - Virginia > Arlington County
      - Arlington (0.04)
  - Canada > British Columbia
    - Metro Vancouver Regional District > Vancouver (0.04)
- Asia > Middle East
  - Jordan (0.04)

Genre:
- Research Report (1.00)
- Overview (0.68)

Technology:
- Information Technology
  - Data Science > Data Mining (0.94)
  - Artificial Intelligence
    - Representation & Reasoning (1.00)
    - Natural Language (1.00)
    - Machine Learning > Neural Networks
      - Deep Learning (0.46)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found