Making Intelligent Document Processing Smarter: Part 1 - KDnuggets
As seen from the table, these cleaning methods do not work on all the images and in fact sometimes the API performance degrades after applying these cleaning methods. Hence there is a need for a unified solution which can work on all kinds of noises. After testing various datasets including Noisy Office, Smart Doc QA, SROIE and custom datasets to compare and evaluate the performance of Tesseract, Vision and Textract, we can conclude that the OCR output gets affected by the noises present in the documents. The inbuilt denoiser or pre-processor is not sufficient to handle most of the noises including motion blur, watermark etc. If the document images are denoised, the OCR output can improve significantly.
Feb-10-2023, 18:00:12 GMT