AITopics | Optical Character Recognition

Collaborating Authors

Optical Character Recognition

Our second example deals with a more challenging problem: the recognition of hand-printed letters of the alphabet. The characters that people print in the ordinary course of filling out forms and questionnaires are surprisingly varied. Gaps abound wherecontinuous lines might be expected; curves and sharp angles appear interchangeably; there is almost every imaginable distortion of slant, shape and size. Even human readers cannot always identify such characters; their error rate is about 3 per cent on randomly selected letters and numbers, seen out of context.
– from Oliver G. Selfridge & Ulric Neisser. PATTERN RECOGNITION BY MACHINE . In Computers & thought, Edward A. Feigenbaum and Julian Feldman (Eds.). MIT Press, Cambridge, MA, USA, 1963. pp. 8-30.

News Overviews Instructional Materials AI-Alerts Classics

Data Generation for Post-OCR correction of Cyrillic handwriting

Davydkin, Evgenii, Markelov, Aleksandr, Iuldashev, Egor, Dudkin, Anton, Krivorotov, Ivan

arXiv.org Artificial IntelligenceNov-27-2023

This paper introduces a novel approach to post-Optical Character Recognition Correction (POC) for handwritten Cyrillic text, addressing a significant gap in current research methodologies. This gap is due to the lack of large text corporas that provide OCR errors for further training of language-based POC models, which are demanding in terms of corpora size. Our study primarily focuses on the development and application of a synthetic handwriting generation engine based on B\'ezier curves. Such an engine generates highly realistic handwritten text in any amounts, which we utilize to create a substantial dataset by transforming Russian text corpora sourced from the internet. We apply a Handwritten Text Recognition (HTR) model to this dataset to identify OCR errors, forming the basis for our POC model training. The correction model is trained on a 90-symbol input context, utilizing a pre-trained T5 architecture with a seq2seq correction task. We evaluate our approach on HWR200 and School_notebooks_RU datasets as they provide significant challenges in the HTR domain. Furthermore, POC can be used to highlight errors for teachers, evaluating student performance. This can be done simply by comparing sentences before and after correction, displaying differences in text. Our primary contribution lies in the innovative use of B\'ezier curves for Cyrillic text generation and subsequent error correction using a specialized POC model. We validate our approach by presenting Word Accuracy Rate (WAR) and Character Accuracy Rate (CAR) results, both with and without post-OCR correction, using real open corporas of handwritten Cyrillic text. These results, coupled with our methodology, are designed to be reproducible, paving the way for further advancements in the field of OCR and handwritten text analysis. Paper contributions can be found in https://github.com/dbrainio/CyrillicHandwritingPOC

augmentation, dataset, handwriting, (14 more...)

arXiv.org Artificial Intelligence

2311.15896

Genre: Research Report > Promising Solution (0.66)

Industry: Education (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)
Information Technology > Artificial Intelligence > Vision > Handwriting Recognition (0.91)
Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (0.86)

Add feedback

StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models

Li, Yinghao Aaron, Han, Cong, Raghavan, Vinay S., Mischler, Gavin, Mesgarani, Nima

arXiv.org Artificial IntelligenceNov-19-2023

In this paper, we present StyleTTS 2, a text-to-speech (TTS) model that leverages style diffusion and adversarial training with large speech language models (SLMs) to achieve human-level TTS synthesis. StyleTTS 2 differs from its predecessor by modeling styles as a latent random variable through diffusion models to generate the most suitable style for the text without requiring reference speech, achieving efficient latent diffusion while benefiting from the diverse speech synthesis offered by diffusion models. Furthermore, we employ large pre-trained SLMs, such as WavLM, as discriminators with our novel differentiable duration modeling for end-to-end training, resulting in improved speech naturalness. StyleTTS 2 surpasses human recordings on the single-speaker LJSpeech dataset and matches it on the multispeaker VCTK dataset as judged by native English speakers. Moreover, when trained on the LibriTTS dataset, our model outperforms previous publicly available models for zero-shot speaker adaptation. This work achieves the first human-level TTS on both single and multispeaker datasets, showcasing the potential of style diffusion and adversarial training with large SLMs. The audio demos and source code are available at https://styletts2.github.io/.

discriminator, speech, styletts 2, (13 more...)

arXiv.org Artificial Intelligence

2306.07691

Country:

North America > United States (0.14)
Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
Europe > Finland > Uusimaa > Helsinki (0.04)
Asia > Japan > Honshū > Kantō > Kanagawa Prefecture (0.04)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (0.68)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Synthesis (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
(2 more...)

Add feedback

A Study on Altering the Latent Space of Pretrained Text to Speech Models for Improved Expressiveness

Vogel, Mathias

arXiv.org Artificial IntelligenceNov-17-2023

This report explores the challenge of enhancing expressiveness control in Text-to-Speech (TTS) models by augmenting a frozen pretrained model with a Diffusion Model that is conditioned on joint semantic audio/text embeddings. The paper identifies the challenges encountered when working with a VAE-based TTS model and evaluates different image-to-image methods for altering latent speech features. Our results offer valuable insights into the complexities of adding expressiveness control to TTS systems and open avenues for future research in this direction.

diffusion model, expressiveness, speech sample, (15 more...)

arXiv.org Artificial Intelligence

2311.10804

Country:

Europe > Switzerland > Zürich > Zürich (0.14)
Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)

Genre: Research Report > New Finding (0.48)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.94)
Information Technology > Artificial Intelligence > Speech > Speech Synthesis (0.72)
Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (0.62)

Add feedback

Efficient End-to-End Visual Document Understanding with Rationale Distillation

Zhu, Wang, Agarwal, Alekh, Joshi, Mandar, Jia, Robin, Thomason, Jesse, Toutanova, Kristina

arXiv.org Artificial IntelligenceNov-16-2023

Understanding visually situated language requires recognizing text and visual elements, and interpreting complex layouts. State-of-the-art methods commonly use specialized pre-processing tools, such as optical character recognition (OCR) systems, that map document image inputs to extracted information in the space of textual tokens, and sometimes also employ large language models (LLMs) to reason in text token space. However, the gains from external tools and LLMs come at the cost of increased computational and engineering complexity. In this paper, we ask whether small pretrained image-to-text models can learn selective text or layout recognition and reasoning as an intermediate inference step in an end-to-end model for pixel-level visual language understanding. We incorporate the outputs of such OCR tools, LLMs, and larger multimodal models as intermediate ``rationales'' on training data, and train a small student model to predict both rationales and answers for input questions based on those training examples. A student model based on Pix2Struct (282M parameters) achieves consistent improvements on three visual document understanding benchmarks representing infographics, scanned documents, and figures, with improvements of more than 4\% absolute over a comparable Pix2Struct model that predicts answers directly.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2311.09612

Country:

Africa (0.46)
Europe > Russia (0.28)
Asia > Russia (0.28)
(5 more...)

Genre: Research Report > Promising Solution (0.34)

Industry:

Energy > Oil & Gas (0.93)
Education (0.88)
Government > Regional Government > North America Government > United States Government (0.46)
Materials > Chemicals > Commodity Chemicals > Petrochemicals (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (0.88)

Add feedback

Transduce and Speak: Neural Transducer for Text-to-Speech with Semantic Token Prediction

Kim, Minchan, Jeong, Myeonghun, Choi, Byoung Jin, Lee, Dongjune, Kim, Nam Soo

arXiv.org Artificial IntelligenceNov-8-2023

We introduce a text-to-speech(TTS) framework based on a neural transducer. We use discretized semantic tokens acquired from wav2vec2.0 embeddings, which makes it easy to adopt a neural transducer for the TTS framework enjoying its monotonic alignment constraints. The proposed model first generates aligned semantic tokens using the neural transducer, then synthesizes a speech sample from the semantic tokens using a non-autoregressive(NAR) speech generator. This decoupled framework alleviates the training complexity of TTS and allows each stage to focus on 1) linguistic and alignment modeling and 2) fine-grained acoustic modeling, respectively. Experimental results on the zero-shot adaptive TTS show that the proposed model exceeds the baselines in speech quality and speaker similarity via objective and subjective measures. We also investigate the inference speed and prosody controllability of our proposed model, showing the potential of the neural transducer for TTS frameworks.

neural transducer, semantic token prediction, transduce and speak, (1 more...)

arXiv.org Artificial Intelligence

2311.02898

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (0.60)
Information Technology > Artificial Intelligence > Speech > Speech Synthesis (0.60)
Information Technology > Artificial Intelligence > Assistive Technologies (0.60)

Add feedback

E3 TTS: Easy End-to-End Diffusion-based Text to Speech

Gao, Yuan, Morioka, Nobuyuki, Zhang, Yu, Chen, Nanxin

arXiv.org Artificial IntelligenceNov-1-2023

We propose Easy End-to-End Diffusion-based Text to Speech, a simple and efficient end-to-end text-to-speech model based on diffusion. E3 TTS directly takes plain text as input and generates an audio waveform through an iterative refinement process. Unlike many prior work, E3 TTS does not rely on any intermediate representations like spectrogram features or alignment information. Instead, E3 TTS models the temporal structure of the waveform through the diffusion process. Without relying on additional conditioning information, E3 TTS could support flexible latent structure within the given audio. This enables E3 TTS to be easily adapted for zero-shot tasks such as editing without any additional training. Experiments show that E3 TTS can generate high-fidelity audio, approaching the performance of a state-of-the-art neural TTS system. Audio samples are available at https://e3tts.github.io.

information, international conference, waveform, (13 more...)

arXiv.org Artificial Intelligence

2311.00945

Country:

North America > United States (0.04)
Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
Europe > Germany > Bavaria > Upper Bavaria > Munich (0.04)
Asia > Japan > Honshū > Tōhoku > Iwate Prefecture > Morioka (0.04)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Synthesis (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (0.83)

Add feedback

The grassroots push to digitize India's most precious documents

MIT Technology ReviewOct-25-2023, 09:00:00 GMT

"Getting access to many of our public libraries is so difficult, and after a point people will give up asking for access. That's the case in many of our public-funded educational institutes too," says Arul George Scaria, an associate professor at the National Law School of India University Bengaluru, who studies intellectual-property law. One of the best ways to liberate access to these libraries, he says, is through digitization. Technologist Omshivaprakash H L felt the acute lack of such resources when he needed references for writing Wikipedia articles in Kannada, a southwestern Indian language. Around 2019, he heard that Carl Malamud, who runs Public Resource, a registered US charity, was already archiving books like Gandhi's Hind Swaraj collection on Indian self-rule and works of the Indian government in the public domain.

internet archive, omshivaprakash, precious document, (10 more...)

MIT Technology Review

Country: Asia > India > Karnataka > Bengaluru (0.28)

Industry:

Law > Intellectual Property & Technology Law (0.57)
Education > Educational Setting > Higher Education (0.57)
Education > Curriculum > Subject-Specific Education (0.57)

Technology:

Information Technology > Communications > Social Media (0.73)
Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (0.33)

Add feedback

GenKIE: Robust Generative Multimodal Document Key Information Extraction

Cao, Panfeng, Wang, Ye, Zhang, Qiang, Meng, Zaiqiao

arXiv.org Artificial IntelligenceOct-24-2023

Key information extraction (KIE) from scanned documents has gained increasing attention because of its applications in various domains. Although promising results have been achieved by some recent KIE approaches, they are usually built based on discriminative models, which lack the ability to handle optical character recognition (OCR) errors and require laborious token-level labelling. In this paper, we propose a novel generative end-to-end model, named GenKIE, to address the KIE task. GenKIE is a sequence-to-sequence multimodal generative model that utilizes multimodal encoders to embed visual, layout and textual features and a decoder to generate the desired output. Well-designed prompts are leveraged to incorporate the label semantics as the weakly supervised signals and entice the generation of the key information. One notable advantage of the generative model is that it enables automatic correction of OCR errors. Besides, token-level granular annotation is not required. Extensive experiments on multiple public real-world datasets show that GenKIE effectively generalizes over different types of documents and achieves state-of-the-art results. Our experiments also validate the model's robustness against OCR errors, making GenKIE highly applicable in real-world scenarios.

genkie, multimodal document key information extraction

arXiv.org Artificial Intelligence

2310.16131

Genre: Research Report (0.40)

Technology:

Information Technology > Data Science > Data Mining > Text Mining (0.60)
Information Technology > Artificial Intelligence > Natural Language > Information Extraction (0.60)
Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (0.53)

Add feedback

DPP-TTS: Diversifying prosodic features of speech via determinantal point processes

Joo, Seongho, Koh, Hyukhun, Jung, Kyomin

arXiv.org Artificial IntelligenceOct-23-2023

With the rapid advancement in deep generative models, recent neural Text-To-Speech(TTS) models have succeeded in synthesizing human-like speech. There have been some efforts to generate speech with various prosody beyond monotonous prosody patterns. However, previous works have several limitations. First, typical TTS models depend on the scaled sampling temperature for boosting the diversity of prosody. Speech samples generated at high sampling temperatures often lack perceptual prosodic diversity, which can adversely affect the naturalness of the speech. Second, the diversity among samples is neglected since the sampling procedure often focuses on a single speech sample rather than multiple ones. In this paper, we propose DPP-TTS: a text-to-speech model based on Determinantal Point Processes (DPPs) with a prosody diversifying module. Our TTS model is capable of generating speech samples that simultaneously consider perceptual diversity in each sample and among multiple samples. We demonstrate that DPP-TTS generates speech samples with more diversified prosody than baselines in the side-by-side comparison test considering the naturalness of speech at the same time.

predictor, prosodic feature, speech, (16 more...)

arXiv.org Artificial Intelligence

2310.14663

Country:

Asia > South Korea > Seoul > Seoul (0.05)
North America > United States (0.04)
Europe > Sweden > Östergötland County > Linköping (0.04)
(2 more...)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.66)
Information Technology > Artificial Intelligence > Speech > Speech Synthesis (0.56)
Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (0.55)

Add feedback

Dict-TTS: Learning to Pronounce with Prior Dictionary Knowledge for Text-to-Speech

Jiang, Ziyue, Su, Zhe, Zhao, Zhou, Yang, Qian, Ren, Yi, Liu, Jinglin, Ye, Zhenhui

arXiv.org Artificial IntelligenceOct-19-2023

Polyphone disambiguation aims to capture accurate pronunciation knowledge from natural text sequences for reliable Text-to-speech (TTS) systems. However, previous approaches require substantial annotated training data and additional efforts from language experts, making it difficult to extend high-quality neural TTS systems to out-of-domain daily conversations and countless languages worldwide. This paper tackles the polyphone disambiguation problem from a concise and novel perspective: we propose Dict-TTS, a semantic-aware generative text-to-speech model with an online website dictionary (the existing prior information in the natural language). Specifically, we design a semantics-to-pronunciation attention (S2PA) module to match the semantic patterns between the input text sequence and the prior semantics in the dictionary and obtain the corresponding pronunciations; The S2PA module can be easily trained with the end-to-end TTS model without any annotated phoneme labels. Experimental results in three languages show that our model outperforms several strong baseline models in terms of pronunciation accuracy and improves the prosody modeling of TTS systems. Further extensive analyses demonstrate that each design in Dict-TTS is effective.

dict-tts, polyphone disambiguation, pronunciation, (13 more...)

arXiv.org Artificial Intelligence

2206.02147

Country:

Europe > Czechia > South Moravian Region > Brno (0.04)
Asia > China (0.04)
North America > United States > Hawaii > Honolulu County > Honolulu (0.04)
(5 more...)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Synthesis (0.92)
Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (0.91)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback