You Actually Look Twice At it (YALTAi): using an object detection approach instead of region segmentation within the Kraken engine
–arXiv.org Artificial Intelligence
Layout Analysis (the identification of zones and their classification) and line segmentation are the first steps in Optical Character Recognition and similar tasks. The ability of identifying the main body of text from marginal text or running titles makes the difference between extracting the full text of a digitized book and noisy outputs. We show that most segmenters focus on pixel classification and that polygonization of this output has not been used as a target for the latest competitions on historical documents (ICDAR 2017 and onwards), despite being the focus in the early 2010s. We suggest that transitioning the task from pixel classification-based polygonization to object detection using isothetic rectangles might improve results in terms of speed and accuracy. We compare the output of Kraken and YOLOv5 in terms of segmentation and show that the latter severely outperforms the first on small datasets (1110 samples and below). We release two datasets for training and evaluation on historical documents as well as a new package, YALTAi, which injects YOLOv5 in the segmentation pipeline of Kraken 4.1. I INTRODUCTION In recent years, automatic text extraction has become an important activity in digital philology and, in general, in corpus creation for historical documents.
arXiv.org Artificial Intelligence
Jul-19-2022
- Country:
- Europe > France
- Île-de-France > Paris > Paris (0.04)
- North America > United States
- New York > New York County > New York City (0.04)
- Europe > France
- Genre:
- Research Report (0.50)
- Workflow (0.48)
- Industry:
- Information Technology (0.46)
- Technology: