paddleocr
LLaVA-Read: Enhancing Reading Ability of Multimodal Language Models
Zhang, Ruiyi, Zhou, Yufan, Chen, Jian, Gu, Jiuxiang, Chen, Changyou, Sun, Tong
Large multimodal language models have demonstrated impressive capabilities in understanding and manipulating images. However, many of these models struggle with comprehending intensive textual contents embedded within the images, primarily due to the limited text recognition and layout understanding ability. To understand the sources of these limitations, we perform an exploratory analysis showing the drawbacks of classical visual encoders on visual text understanding. Hence, we present LLaVA-Read, a multimodal large language model that utilizes dual visual encoders along with a visual text encoder. Our model surpasses existing state-of-the-art models in various text-rich image understanding tasks, showcasing enhanced comprehension of textual content within images. Together, our research suggests visual text understanding remains an open challenge and an efficient visual text encoder is crucial for future successful multimodal systems.
A Novel Implementation of Marksheet Parser Using PaddleOCR
Bagaria, Sankalp, Irene, S, Harikrishnan, null, M, Elakia V
When an applicant files an online application, there is usually a requirement to fill the marks in the online form and also upload the marksheet in the portal for the verification. A system was built for reading the uploaded marksheet using OCR and automatically filling the rows/ columns in the online form. Though there are partial solutions to this problem - implemented using PyTesseract - the accuracy is low. Hence, the PaddleOCR was used to build the marksheet parser. Several pre-processing and post-processing steps were also performed. The system was tested and evaluated for seven states. Further work is being done and the system is being evaluated for more states and boards of India.
The Solution for the ICCV 2023 1st Scientific Figure Captioning Challenge
Chao, Dian, Song, Xin, Zhong, Shupeng, Wang, Boyuan, Wu, Xiangyu, Zhu, Chen, Yang, Yang
In this paper, we propose a solution for improving the quality of captions generated for figures in papers. We adopt the approach of summarizing the textual content in the paper to generate image captions. Throughout our study, we encounter discrepancies in the OCR information provided in the official dataset. To rectify this, we employ the PaddleOCR toolkit to extract OCR information from all images. Moreover, we observe that certain textual content in the official paper pertains to images that are not relevant for captioning, thereby introducing noise during caption generation. To mitigate this issue, we leverage LLaMA to extract image-specific information by querying the textual content based on image mentions, effectively filtering out extraneous information. Additionally, we recognize a discrepancy between the primary use of maximum likelihood estimation during text generation and the evaluation metrics such as ROUGE employed to assess the quality of generated captions. To bridge this gap, we integrate the BRIO model framework, enabling a more coherent alignment between the generation and evaluation processes. Our approach ranked first in the final test with a score of 4.49.
EfficientOCR: An Extensible, Open-Source Package for Efficiently Digitizing World Knowledge
Bryan, Tom, Carlson, Jacob, Arora, Abhishek, Dell, Melissa
Billions of public domain documents remain trapped in hard copy or lack an accurate digitization. Modern natural language processing methods cannot be used to index, retrieve, and summarize their texts; conduct computational textual analyses; or extract information for statistical analyses, and these texts cannot be incorporated into language model training. Given the diversity and sheer quantity of public domain texts, liberating them at scale requires optical character recognition (OCR) that is accurate, extremely cheap to deploy, and sample-efficient to customize to novel collections, languages, and character sets. Existing OCR engines, largely designed for small-scale commercial applications in high resource languages, often fall short of these requirements. EffOCR (EfficientOCR), a novel open-source OCR package, meets both the computational and sample efficiency requirements for liberating texts at scale by abandoning the sequence-to-sequence architecture typically used for OCR, which takes representations from a learned vision model as inputs to a learned language model. Instead, EffOCR models OCR as a character or word-level image retrieval problem. EffOCR is cheap and sample efficient to train, as the model only needs to learn characters' visual appearance and not how they are used in sequence to form language. Models in the EffOCR model zoo can be deployed off-the-shelf with only a few lines of code. Importantly, EffOCR also allows for easy, sample efficient customization with a simple model training interface and minimal labeling requirements due to its sample efficiency. We illustrate the utility of EffOCR by cheaply and accurately digitizing 20 million historical U.S. newspaper scans, evaluating zero-shot performance on randomly selected documents from the U.S. National Archives, and accurately digitizing Japanese documents for which all other OCR solutions failed.
Quantifying Character Similarity with Vision Transformers
Yang, Xinmei, Arora, Abhishek, Jheng, Shao-Yu, Dell, Melissa
Record linkage is a bedrock of quantitative social science, as analyses often require linking data from multiple, noisy sources. Off-the-shelf string matching methods are widely used, as they are straightforward and cheap to implement and scale. Not all character substitutions are equally probable, and for some settings there are widely used handcrafted lists denoting which string substitutions are more likely, that improve the accuracy of string matching. However, such lists do not exist for many settings, skewing research with linked datasets towards a few high-resource contexts that are not representative of the diversity of human societies. This study develops an extensible way to measure character substitution costs for OCR'ed documents, by employing large-scale self-supervised training of vision transformers (ViT) with augmented digital fonts. For each language written with the CJK script, we contrastively learn a metric space where different augmentations of the same character are represented nearby. In this space, homoglyphic characters - those with similar appearance such as ``O'' and ``0'' - have similar vector representations. Using the cosine distance between characters' representations as the substitution cost in an edit distance matching algorithm significantly improves record linkage compared to other widely used string matching methods, as OCR errors tend to be homoglyphic in nature. Homoglyphs can plausibly capture character visual similarity across any script, including low-resource settings. We illustrate this by creating homoglyph sets for 3,000 year old ancient Chinese characters, which are highly pictorial. Fascinatingly, a ViT is able to capture relationships in how different abstract concepts were conceptualized by ancient societies, that have been noted in the archaeological literature.
Optical Character Recognition using PaddleOCR
Reading huge documents can be very tiring and very time taking. You must have seen many software or applications where you just click a picture and get key information from the document. This is done by a technique called Optical Character Recognition (OCR). Optical Character Recognition is one of the key researches in the field of AI in recent years. Optical Character Recognition is the process of recognizing text from an image by understanding and analyzing its underlying patterns. This blog post will focus on implementing and comparing various OCR algorithms provided by PaddleOCR using just a few lines of code. Optical Character Recognition is the technique that recognizes and converts text into a machine-readable format by analyzing and understanding its underlying patterns. OCR can recognize handwritten text, printed text and texts "in the wild". In short, OCR enables computers to read. But how does OCR work? OCR makes use of Deep learning and computer vision techniques.
Baidu AI Research Brings A Significant Upgrade To PaddleOCR's Open-Source OCR System
A significant enhancement has been made to PaddleOCR, the multilingual optical character recognition (OCR) toolkits. With over 80 different multi-language recognition models and an easy-to-use interface, PaddleOCR is an open-source OCR repository worth checking out. OCRv3 PP-OCRv3 has a 5% to 11% increase in accuracy in English and multilingual scenarios. Annotation functions for tables, irregular text pictures, and essential information extraction tasks have been added to PPOCRLabelv2. "Dive into OCR," a new interactive e-book, is now available.