AITopics | Optical Character Recognition

Collaborating Authors

Optical Character Recognition

Our second example deals with a more challenging problem: the recognition of hand-printed letters of the alphabet. The characters that people print in the ordinary course of filling out forms and questionnaires are surprisingly varied. Gaps abound wherecontinuous lines might be expected; curves and sharp angles appear interchangeably; there is almost every imaginable distortion of slant, shape and size. Even human readers cannot always identify such characters; their error rate is about 3 per cent on randomly selected letters and numbers, seen out of context.
– from Oliver G. Selfridge & Ulric Neisser. PATTERN RECOGNITION BY MACHINE . In Computers & thought, Edward A. Feigenbaum and Julian Feldman (Eds.). MIT Press, Cambridge, MA, USA, 1963. pp. 8-30.

News Overviews Instructional Materials AI-Alerts Classics

Building a Luganda Text-to-Speech Model From Crowdsourced Data

Kagumire, Sulaiman, Katumba, Andrew, Nakatumba-Nabende, Joyce, Quinn, John

arXiv.org Artificial IntelligenceMay-16-2024

Text-to-speech (TTS) development for African languages such as Luganda is still limited, primarily due to the scarcity of high-quality, single-speaker recordings essential for training TTS models. Prior work has focused on utilizing the Luganda Common Voice recordings of multiple speakers aged between 20-49. Although the generated speech is intelligible, it is still of lower quality than the model trained on studio-grade recordings. This is due to the insufficient data preprocessing methods applied to improve the quality of the Common Voice recordings. Furthermore, speech convergence is more difficult to achieve due to varying intonations, as well as background noise. In this paper, we show that the quality of Luganda TTS from Common Voice can improve by training on multiple speakers of close intonation in addition to further preprocessing of the training data. Specifically, we selected six female speakers with close intonation determined by subjectively listening and comparing their voice recordings. In addition to trimming out silent portions from the beginning and end of the recordings, we applied a pre-trained speech enhancement model to reduce background noise and enhance audio quality. We also utilized a pre-trained, non-intrusive, self-supervised Mean Opinion Score (MOS) estimation model to filter recordings with an estimated MOS over 3.5, indicating high perceived quality. Subjective MOS evaluations from nine native Luganda speakers demonstrate that our TTS model achieves a significantly better MOS of 3.55 compared to the reported 2.5 MOS of the existing model. Moreover, for a fair comparison, our model trained on six speakers outperforms models trained on a single-speaker (3.13 MOS) or two speakers (3.22 MOS). This showcases the effectiveness of compensating for the lack of data from one speaker with data from multiple speakers of close intonation to improve TTS quality.

female speaker, intonation, multiple speaker, (12 more...)

arXiv.org Artificial Intelligence

2405.10211

Country:

Africa > Uganda > Central Region > Kampala (0.05)
Africa > East Africa (0.04)

Genre: Research Report (0.65)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Synthesis (0.73)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.69)
Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (0.62)

Add feedback

Multi-Cell Decoder and Mutual Learning for Table Structure and Character Recognition

Kawakatsu, Takaya

arXiv.org Artificial IntelligenceMay-12-2024

Extracting table contents from documents such as scientific papers and financial reports and converting them into a format that can be processed by large language models is an important task in knowledge information processing. End-to-end approaches, which recognize not only table structure but also cell contents, achieved performance comparable to state-of-the-art models using external character recognition systems, and have potential for further improvements. In addition, these models can now recognize long tables with hundreds of cells by introducing local attention. However, the models recognize table structure in one direction from the header to the footer, and cell content recognition is performed independently for each cell, so there is no opportunity to retrieve useful information from the neighbor cells. In this paper, we propose a multi-cell content decoder and bidirectional mutual learning mechanism to improve the end-to-end approach. The effectiveness is demonstrated on two large datasets, and the experimental results show comparable performance to state-of-the-art models, even for long tables with large numbers of cells.

decoder, recognition, structure recognition, (10 more...)

arXiv.org Artificial Intelligence

2404.13268

Country: Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.14)

Genre: Research Report > New Finding (0.34)

Industry: Banking & Finance (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Pattern Recognition (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (0.84)

Add feedback

This AI-powered text-to-speech toolkit is now more than 150 off

PCWorldApr-19-2024, 08:00:00 GMT

In the fast-paced modern working world, we're all looking for ways to save time and increase productivity. With the Jott Pro AI Text & Speech Toolkit, you can simplify workflows by processing text and recordings much faster. Now through 4/21, you can get it for more than 150 off. Jott Pro is a text and audio processor that allows you to transform spoken words into written text with extreme accuracy by leveraging its neural AI technology. Likewise, you can also convert text to lifelike speech, or translate text into almost any language you want.

ai text & speech toolkit, ai-powered text-to-speech toolkit, jott, (1 more...)

PCWorld

Technology:

Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (0.40)
Information Technology > Artificial Intelligence > Speech > Speech Synthesis (0.40)
Information Technology > Artificial Intelligence > Assistive Technologies (0.40)

Add feedback

Making Old Kurdish Publications Processable by Augmenting Available Optical Character Recognition Engines

Yaseen, Blnd, Hassani, Hossein

arXiv.org Artificial IntelligenceApr-9-2024

Kurdish libraries have many historical publications that were printed back in the early days when printing devices were brought to Kurdistan. Having a good Optical Character Recognition (OCR) to help process these publications and contribute to the Kurdish languages resources which is crucial as Kurdish is considered a low-resource language. Current OCR systems are unable to extract text from historical documents as they have many issues, including being damaged, very fragile, having many marks left on them, and often written in non-standard fonts and more. This is a massive obstacle in processing these documents as currently processing them requires manual typing which is very time-consuming. In this study, we adopt an open-source OCR framework by Google, Tesseract version 5.0, that has been used to extract text for various languages. Currently, there is no public dataset, and we developed our own by collecting historical documents from Zheen Center for Documentation and Research, which were printed before 1950 and resulted in a dataset of 1233 images of lines with transcription of each. Then we used the Arabic model as our base model and trained the model using the dataset. We used different methods to evaluate our model, Tesseracts built-in evaluator lstmeval indicated a Character Error Rate (CER) of 0.755%. Additionally, Ocreval demonstrated an average character accuracy of 84.02%. Finally, we developed a web application to provide an easy- to-use interface for end-users, allowing them to interact with the model by inputting an image of a page and extracting the text. Having an extensive dataset is crucial to develop OCR systems with reasonable accuracy, as currently, no public datasets are available for historical Kurdish documents; this posed a significant challenge in our work. Additionally, the unaligned spaces between characters and words proved another challenge with our work.

dataset, historical document, recognition, (14 more...)

arXiv.org Artificial Intelligence

2404.06101

Country:

Asia > Middle East > Iraq > Baghdad Governorate > Baghdad (0.04)
Europe > Middle East > Republic of Türkiye > Istanbul Province > Istanbul (0.04)
Asia > Middle East > Republic of Türkiye > Istanbul Province > Istanbul (0.04)
(17 more...)

Genre: Research Report > New Finding (1.00)

Industry: Government (0.68)

Technology:

Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

PEaCE: A Chemistry-Oriented Dataset for Optical Character Recognition on Scientific Documents

Zhang, Nan, Heaton, Connor, Okonsky, Sean Timothy, Mitra, Prasenjit, Toraman, Hilal Ezgi

arXiv.org Artificial IntelligenceMar-23-2024

Optical Character Recognition (OCR) is an established task with the objective of identifying the text present in an image. While many off-the-shelf OCR models exist, they are often trained for either scientific (e.g., formulae) or generic printed English text. Extracting text from chemistry publications requires an OCR model that is capable in both realms. Nougat, a recent tool, exhibits strong ability to parse academic documents, but is unable to parse tables in PubMed articles, which comprises a significant part of the academic community and is the focus of this work. To mitigate this gap, we present the Printed English and Chemical Equations (PEaCE) dataset, containing both synthetic and real-world records, and evaluate the efficacy of transformer-based OCR models when trained on this resource. Given that real-world records contain artifacts not present in synthetic records, we propose transformations that mimic such qualities. We perform a suite of experiments to explore the impact of patch size, multi-domain training, and our proposed transformations, ultimately finding that models with a small patch size trained on multiple domains using the proposed transformations yield the best performance. Our dataset and code is available at https://github.com/ZN1010/PEaCE.

dataset, patch size, real-world test, (15 more...)

arXiv.org Artificial Intelligence

2403.15724

Country:

North America > United States > Pennsylvania (0.05)
Europe > Czechia > South Moravian Region > Brno (0.04)
Europe > Switzerland > Vaud > Lausanne (0.04)
(4 more...)

Genre: Research Report (0.64)

Industry:

Government > Regional Government > North America Government > United States Government (0.68)
Materials > Chemicals (0.68)

Technology:

Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Iterative Learning for Reliable Crowdsourcing Systems

Neural Information Processing SystemsMar-15-2024, 10:00:30 GMT

Crowdsourcing systems, in which tasks are electronically distributed to numerous "information piece-workers", have emerged as an effective paradigm for humanpowered solving of large scale problems in domains such as image classification, data entry, optical character recognition, recommendation, and proofreading. Because these low-paid workers can be unreliable, nearly all crowdsourcers must devise schemes to increase confidence in their answers, typically by assigning each task multiple times and combining the answers in some way such as majority voting. In this paper, we consider a general model of such crowdsourcing tasks, and pose the problem of minimizing the total price (i.e., number of task assignments) that must be paid to achieve a target overall reliability. We give a new algorithm for deciding which tasks to assign to which workers and for inferring correct answers from the workers' answers. We show that our algorithm significantly outperforms majority voting and, in fact, is asymptotically optimal through comparison to an oracle that knows the reliability of every worker.

algorithm, iterative algorithm, reliability, (15 more...)

Neural Information Processing Systems

Country:

North America > United States > California > San Francisco County > San Francisco (0.14)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
North America > United States > New York > New York County > New York City (0.04)
(2 more...)

Technology:

Information Technology > Communications > Social Media > Crowdsourcing (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (0.68)

Add feedback

Adversarial Training with OCR Modality Perturbation for Scene-Text Visual Question Answering

Shen, Zhixuan, Luo, Haonan, Li, Sijia, Li, Tianrui

arXiv.org Artificial IntelligenceMar-14-2024

These methods typically involve largescale Scene-Text Visual Question Answering (ST-VQA) aims to pretraining followed by fine-tuning to adapt the model for understand scene text in images and answer questions related question-answering tasks in text-rich scene images, often ignoring to the text content. Most existing methods heavily rely on the the inevitable OCR text recognition challenges. In practice, accuracy of Optical Character Recognition (OCR) systems, scene images may exhibit phenomena such as blurring, and aggressive fine-tuning based on limited spatial location distortion, skewness, or uneven lighting, leading to erroneous information and erroneous OCR text information often leads character recognition by OCR systems, especially in cases to inevitable overfitting. In this paper, we propose a multimodal of low-quality handwriting. Even when OCR systems correctly adversarial training architecture with spatial awareness identify characters, discrete and semantically irrelevant capabilities. Specifically, we introduce an Adversarial OCR recognition results may impact the comprehension of the OCR Enhancement (AOE) module, which leverages adversarial text semantics.

adversarial training, embedding, ocr, (14 more...)

arXiv.org Artificial Intelligence

2403.09288

Country: Asia > China > Sichuan Province (0.04)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Calibrated Structured Prediction

Neural Information Processing SystemsMar-12-2024, 23:31:06 GMT

In user-facing applications, displaying calibrated confidence measures-- probabilities that correspond to true frequency--can be as important as obtaining high accuracy. We are interested in calibration for structured prediction problems such as speech recognition, optical character recognition, and medical diagnosis. Structured prediction presents new challenges for calibration: the output space is large, and users may issue many types of probability queries (e.g., marginals) on the structured output. We extend the notion of calibration so as to handle various subtleties pertaining to the structured setting, and then provide a simple recalibration method that trains a binary classifier to predict probabilities of interest. We explore a range of features appropriate for structured recalibration, and demonstrate their efficacy on three real-world datasets.

calibration, probability, recalibration, (16 more...)

Neural Information Processing Systems

Country:

North America > Canada > Ontario > Toronto (0.14)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
North America > United States > California > Santa Clara County > Stanford (0.04)
North America > United States > Massachusetts (0.04)

Industry: Health & Medicine (0.48)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Inductive Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Supervised Learning (0.94)
Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (0.87)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Support Vector Machines (0.46)

Add feedback

How to save papers, photos, and analog music digitally

PCWorldMar-8-2024, 14:00:00 GMT

Do you, like me, have paper documents that have long since been scanned and processed, records or music cassettes that you would like to listen to on your mobile phone, and photo prints that are planned for a digital photo book? Then you will appreciate the two-step instructions in this article, with which you can convert analog media to digital and then process them further. Important insurance papers, contracts, invoices, or simply the page-long letter from your favorite aunt -- there are many paper documents that you want to scan in order to preserve them. If it's even a text that you want to search and edit, you can run OCR software over it after scanning, which recognizes the text so that you can search it and, if necessary, edit it with a standard word processor. With the freeware Not Another PDF Scanner 2 (Naps 2), you have plenty of options for editing and saving the scan after scanning a document.

analog music digitally, folder, software, (13 more...)

PCWorld

Technology:

Information Technology > Communications > Mobile (0.91)
Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (0.90)

Add feedback

LOCR: Location-Guided Transformer for Optical Character Recognition

Sun, Yu, Zhou, Dongzhan, Lin, Chen, He, Conghui, Ouyang, Wanli, Zhong, Han-Sen

arXiv.org Artificial IntelligenceMar-4-2024

Academic documents are packed with texts, equations, tables, and figures, requiring comprehensive understanding for accurate Optical Character Recognition (OCR). While end-to-end OCR methods offer improved accuracy over layout-based approaches, they often grapple with significant repetition issues, especially with complex layouts in Out-Of-Domain (OOD) documents.To tackle this issue, we propose LOCR, a model that integrates location guiding into the transformer architecture during autoregression. We train the model on a dataset comprising over 77M text-location pairs from 125K academic document pages, including bounding boxes for words, tables and mathematical symbols. LOCR adeptly handles various formatting elements and generates content in Markdown language. It outperforms all existing methods in our test set constructed from arXiv, as measured by edit distance, BLEU, METEOR and F-measure.LOCR also reduces repetition frequency from 4.4% of pages to 0.5% in the arXiv dataset, from 13.2% to 1.3% in OOD quantum physics documents and from 8.1% to 1.8% in OOD marketing documents. Additionally, LOCR features an interactive OCR mode, facilitating the generation of complex documents through a few location prompts from human.

arxiv, locr, nougat, (16 more...)

arXiv.org Artificial Intelligence

2403.02127

Country:

Asia > China > Shanghai > Shanghai (0.05)
Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.88)

Add feedback