American Stories: A Large-Scale Structured Text Dataset of Historical U.S. Newspapers

Jan-20-2025, 03:36:44 GMT–Neural Information Processing Systems

Existing full text datasets of U.S. public domain newspapers do not recognize the often complex layouts of newspaper scans, and as a result the digitized content scrambles texts from articles, headlines, captions, advertisements, and other layout regions. OCR quality can also be low. This study develops a novel, deep learning pipeline for extracting full article texts from newspaper images and applies it to the nearly 20 million scans in Library of Congress's public domain Chronicling America collection. The pipeline includes layout detection, legibility classification, custom OCR, and association of article texts spanning multiple bounding boxes. To achieve high scalability, it is built with efficient architectures designed for mobile phones.

american story, large-scale structured text dataset, newspaper, (6 more...)

Neural Information Processing Systems

Jan-20-2025, 03:36:44 GMT

Conferences Web Page

Add feedback

Industry:
- Media > News (1.00)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language (0.64)
  - Machine Learning > Neural Networks
    - Deep Learning (0.41)