AITopics | Optical Character Recognition

Collaborating Authors

Optical Character Recognition

Our second example deals with a more challenging problem: the recognition of hand-printed letters of the alphabet. The characters that people print in the ordinary course of filling out forms and questionnaires are surprisingly varied. Gaps abound wherecontinuous lines might be expected; curves and sharp angles appear interchangeably; there is almost every imaginable distortion of slant, shape and size. Even human readers cannot always identify such characters; their error rate is about 3 per cent on randomly selected letters and numbers, seen out of context.
– from Oliver G. Selfridge & Ulric Neisser. PATTERN RECOGNITION BY MACHINE . In Computers & thought, Edward A. Feigenbaum and Julian Feldman (Eds.). MIT Press, Cambridge, MA, USA, 1963. pp. 8-30.

News Overviews Instructional Materials AI-Alerts Classics

Rethinking MUSHRA: Addressing Modern Challenges in Text-to-Speech Evaluation

Varadhan, Praveen Srinivasa, Gulati, Amogh, Sankar, Ashwin, Anand, Srija, Gupta, Anirudh, Mukherjee, Anirudh, Marepally, Shiva Kumar, Bhatia, Ankur, Jaju, Saloni, Bhooshan, Suvrat, Khapra, Mitesh M.

arXiv.org Artificial IntelligenceDec-15-2024

Despite rapid advancements in TTS models, a consistent and robust human evaluation framework is still lacking. For example, MOS tests fail to differentiate between similar models, and CMOS's pairwise comparisons are time-intensive. The MUSHRA test is a promising alternative for evaluating multiple TTS systems simultaneously, but in this work we show that its reliance on matching human reference speech unduly penalises the scores of modern TTS systems that can exceed human speech quality. More specifically, we conduct a comprehensive assessment of the MUSHRA test, focusing on its sensitivity to factors such as rater variability, listener fatigue, and reference bias. Based on our extensive evaluation involving 492 human listeners across Hindi and Tamil we identify two primary shortcomings: (i) reference-matching bias, where raters are unduly influenced by the human reference, and (ii) judgement ambiguity, arising from a lack of clear fine-grained guidelines. To address these issues, we propose two refined variants of the MUSHRA test. The first variant enables fairer ratings for synthesized samples that surpass human reference quality. The second variant reduces ambiguity, as indicated by the relatively lower variance across raters. By combining these approaches, we achieve both more reliable and more fine-grained assessments. We also release MANGO, a massive dataset of 246,000 human ratings, the first-of-its-kind collection for Indian languages, aiding in analyzing human preferences and developing automatic metrics for evaluating TTS systems.

artificial intelligence, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2411.12719

Country:

Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.14)
North America > Dominican Republic (0.04)
Europe > Spain > Catalonia > Barcelona Province > Barcelona (0.04)
(4 more...)

Genre: Research Report > Experimental Study (0.93)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Synthesis (0.66)
Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (0.42)

Add feedback

RoundTripOCR: A Data Generation Technique for Enhancing Post-OCR Error Correction in Low-Resource Devanagari Languages

Kashid, Harshvivek, Bhattacharyya, Pushpak

arXiv.org Artificial IntelligenceDec-14-2024

Optical Character Recognition (OCR) technology has revolutionized the digitization of printed text, enabling efficient data extraction and analysis across various domains. Just like Machine Translation systems, OCR systems are prone to errors. In this work, we address the challenge of data generation and post-OCR error correction, specifically for low-resource languages. We propose an approach for synthetic data generation for Devanagari languages, RoundTripOCR, that tackles the scarcity of the post-OCR Error Correction datasets for low-resource languages. We release post-OCR text correction datasets for Hindi, Marathi, Bodo, Nepali, Konkani and Sanskrit. We also present a novel approach for OCR error correction by leveraging techniques from machine translation. Our method involves translating erroneous OCR output into a corrected form by treating the OCR errors as mistranslations in a parallel text corpus, employing pre-trained transformer models to learn the mapping from erroneous to correct text pairs, effectively correcting OCR errors.

computational linguistic, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2412.15248

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Europe > Ireland > Leinster > County Dublin > Dublin (0.04)
Oceania > Australia > Victoria > Melbourne (0.04)
(12 more...)

Genre: Research Report > Promising Solution (0.34)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

A Preliminary Analysis of Automatic Word and Syllable Prominence Detection in Non-Native Speech With Text-to-Speech Prosody Embeddings

Mondal, Anindita, Bharadwaj, Rangavajjala Sankara, Mallela, Jhansi, Vuppala, Anil Kumar, Yarra, Chiranjeevi

arXiv.org Artificial IntelligenceDec-11-2024

Automatic detection of prominence at the word and syllable-levels is critical for building computer-assisted language learning systems. It has been shown that prosody embeddings learned by the current state-of-the-art (SOTA) text-to-speech (TTS) systems could generate word- and syllable-level prominence in the synthesized speech as natural as in native speech. To understand the effectiveness of prosody embeddings from TTS for prominence detection under nonnative context, a comparative analysis is conducted on the embeddings extracted from native and non-native speech considering the prominence-related embeddings: duration, energy, and pitch from a SOTA TTS named FastSpeech2. These embeddings are extracted under two conditions considering: 1) only text, 2) both speech and text. For the first condition, the embeddings are extracted directly from the TTS inference mode, whereas for the second condition, we propose to extract from the TTS under training mode. Experiments are conducted on native speech corpus: Tatoeba, and non-native speech corpus: ISLE. For experimentation, word-level prominence locations are manually annotated for both corpora. The highest relative improvement on word \& syllable-level prominence detection accuracies with the TTS embeddings are found to be 13.7% & 5.9% and 16.2% & 6.9% compared to those with the heuristic-based features and self-supervised Wav2Vec-2.0 representations, respectively.

artificial intelligence, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2412.08283

Country:

Oceania > Australia > Australian Capital Territory > Canberra (0.04)
North America > Canada > Quebec > Montreal (0.04)
Asia > India > Telangana > Hyderabad (0.04)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Synthesis (0.72)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.69)
(2 more...)

Add feedback

Text Change Detection in Multilingual Documents Using Image Comparison

Park, Doyoung, Yarram, Naresh Reddy, Kim, Sunjin, Kim, Minkyu, Cho, Seongho, Lee, Taehee

arXiv.org Artificial IntelligenceDec-5-2024

Document comparison typically relies on optical character recognition (OCR) as its core technology. However, OCR requires the selection of appropriate language models for each document and the performance of multilingual or hybrid models remains limited. To overcome these challenges, we propose text change detection (TCD) using an image comparison model tailored for multilingual documents. Unlike OCR-based approaches, our method employs word-level text image-to-image comparison to detect changes. Our model generates bidirectional change segmentation maps between the source and target documents. To enhance performance without requiring explicit text alignment or scaling preprocessing, we employ correlations among multi-scale attention features. We also construct a benchmark dataset comprising actual printed and scanned word pairs in various languages to evaluate our model. We validate our approach using our benchmark dataset and public benchmarks Distorted Document Images and the LRDE Document Binarization Dataset. We compare our model against state-of-the-art semantic segmentation and change detection models, as well as to conventional OCR-based models.

dataset, detection, recognition, (15 more...)

arXiv.org Artificial Intelligence

2412.04137

Country:

South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
North America > United States > Nevada > Clark County > Las Vegas (0.04)
Europe > Germany > Bavaria > Upper Bavaria > Munich (0.04)

Genre: Research Report (1.00)

Industry: Health & Medicine (0.46)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
(2 more...)

Add feedback

DiffStyleTTS: Diffusion-based Hierarchical Prosody Modeling for Text-to-Speech with Diverse and Controllable Styles

Liu, Jiaxuan, Liu, Zhaoci, Hu, Yajun, Gao, Yingying, Zhang, Shilei, Ling, Zhenhua

arXiv.org Artificial IntelligenceDec-4-2024

Human speech exhibits rich and flexible prosodic variations. To address the one-to-many mapping problem from text to prosody in a reasonable and flexible manner, we propose DiffStyleTTS, a multi-speaker acoustic model based on a conditional diffusion module and an improved classifier-free guidance, which hierarchically models speech prosodic features, and controls different prosodic styles to guide prosody prediction. Experiments show that our method outperforms all baselines in naturalness and achieves superior synthesis speed compared to three diffusion-based baselines. Additionally, by adjusting the guiding scale, DiffStyleTTS effectively controls the guidance intensity of the synthetic prosody.

artificial intelligence, machine learning, prosodic feature, (17 more...)

arXiv.org Artificial Intelligence

2412.03388

Country:

Asia > China > Anhui Province > Hefei (0.04)
Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
Asia > China > Beijing > Beijing (0.04)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Synthesis (0.52)
Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (0.41)

Add feedback

A comprehensive survey of oracle character recognition: challenges, benchmarks, and beyond

Li, Jing, Chi, Xueke, Wang, Qiufeng, Wang, Dahan, Huang, Kaizhu, Liu, Yongge, Liu, Cheng-lin

arXiv.org Artificial IntelligenceNov-18-2024

Oracle character recognition-an analysis of ancient Chinese inscriptions found on oracle bones-has become a pivotal field intersecting archaeology, paleography, and historical cultural studies. Traditional methods of oracle character recognition have relied heavily on manual interpretation by experts, which is not only labor-intensive but also limits broader accessibility to the general public. With recent breakthroughs in pattern recognition and deep learning, there is a growing movement towards the automation of oracle character recognition (OrCR), showing considerable promise in tackling the challenges inherent to these ancient scripts. However, a comprehensive understanding of OrCR still remains elusive. Therefore, this paper presents a systematic and structured survey of the current landscape of OrCR research. We commence by identifying and analyzing the key challenges of OrCR. Then, we provide an overview of the primary benchmark datasets and digital resources available for OrCR. A review of contemporary research methodologies follows, in which their respective efficacies, limitations, and applicability to the complex nature of oracle characters are critically highlighted and examined. Additionally, our review extends to ancillary tasks associated with OrCR across diverse disciplines, providing a broad-spectrum analysis of its applications. We conclude with a forward-looking perspective, proposing potential avenues for future investigations that could yield significant advancements in the field.

character recognition, oracle character, recognition, (16 more...)

arXiv.org Artificial Intelligence

2411.11354

Country:

Asia > China > Zhejiang Province > Ningbo (0.04)
Asia > China > Fujian Province > Xiamen (0.04)
Asia > China > Shaanxi Province > Xi'an (0.04)
Asia > China > Beijing > Beijing (0.04)

Genre: Overview (1.00)

Technology:

Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Pattern Recognition (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

EmoSphere++: Emotion-Controllable Zero-Shot Text-to-Speech via Emotion-Adaptive Spherical Vector

Cho, Deok-Hyeon, Oh, Hyung-Seok, Kim, Seung-Bin, Lee, Seong-Whan

arXiv.org Artificial IntelligenceNov-4-2024

Emotional text-to-speech (TTS) technology has achieved significant progress in recent years; however, challenges remain owing to the inherent complexity of emotions and limitations of the available emotional speech datasets and models. Previous studies typically relied on limited emotional speech datasets or required extensive manual annotations, restricting their ability to generalize across different speakers and emotional styles. In this paper, we present EmoSphere++, an emotion-controllable zero-shot TTS model that can control emotional style and intensity to resemble natural human speech. We introduce a novel emotion-adaptive spherical vector that models emotional style and intensity without human annotation. Moreover, we propose a multi-level style encoder that can ensure effective generalization for both seen and unseen speakers. We also introduce additional loss functions to enhance the emotion transfer performance for zero-shot scenarios. We employ a conditional flow matching-based decoder to achieve high-quality and expressive emotional TTS in a few sampling steps. Experimental results demonstrate the effectiveness of the proposed framework.

emotion, intensity, speech synthesis, (13 more...)

arXiv.org Artificial Intelligence

2411.02625

Country:

Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
Asia > South Korea > Seoul > Seoul (0.04)

Genre: Research Report > New Finding (0.66)

Technology:

Information Technology > Artificial Intelligence > Cognitive Science > Emotion (0.94)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.82)
Information Technology > Artificial Intelligence > Speech > Speech Synthesis (0.73)
Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (0.71)

Add feedback

Towards Visual Text Design Transfer Across Languages

Choi, Yejin, Chung, Jiwan, Shim, Sumin, Oh, Giyeong, Yu, Youngjae

arXiv.org Artificial IntelligenceOct-29-2024

Visual text design plays a critical role in conveying themes, emotions, and atmospheres in multimodal formats such as film posters and album covers. Translating these visual and textual elements across languages extends the concept of translation beyond mere text, requiring the adaptation of aesthetic and stylistic features. To address this, we introduce a novel task of Multimodal Style Translation (MuST-Bench), a benchmark designed to evaluate the ability of visual text generation models to perform translation across different writing systems while preserving design intent. Our initial experiments on MuST-Bench reveal that existing visual text generation models struggle with the proposed task due to the inadequacy of textual descriptions in conveying visual design. In response, we introduce SIGIL, a framework for multimodal style translation that eliminates the need for style descriptions. SIGIL enhances image generation models through three innovations: glyph latent for multilingual settings, pretrained VAEs for stable style guidance, and an OCR model with reinforcement learning feedback for optimizing readable character generation. SIGIL outperforms existing baselines by achieving superior style consistency and legibility while maintaining visual fidelity, setting itself apart from traditional description-based approaches. We release MuST-Bench publicly for broader use and exploration https://huggingface.co/datasets/yejinc/MuST-Bench.

evaluation, image 5, typography, (17 more...)

arXiv.org Artificial Intelligence

2410.18823

Country: Asia > Thailand (0.04)

Genre: Research Report > New Finding (0.46)

Industry:

Media > Film (1.00)
Leisure & Entertainment (0.93)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (0.87)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Continuous Speech Tokenizer in Text To Speech

Li, Yixing, Xie, Ruobing, Sun, Xingwu, Cheng, Yu, Kang, Zhanhui

arXiv.org Artificial IntelligenceOct-22-2024

The fusion of speech and language in the era of large language models has garnered significant attention. Discrete speech token is often utilized in text-to-speech tasks for speech compression and portability, which is convenient for joint training with text and have good compression efficiency. However, we found that the discrete speech tokenizer still suffers from information loss. Therefore, we propose a simple yet effective continuous speech tokenizer and a text-to-speech model based on continuous speech tokens. Our results show that the speech language model based on the continuous speech tokenizer has better continuity and higher estimated Mean Opinion Scores (MoS). This enhancement is attributed to better information preservation rate of the continuous speech tokenizer across both low and high frequencies in the frequency domain.

large language model, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2410.17081

Country:

Asia > Middle East > Israel (0.04)
Asia > China > Hong Kong (0.04)

Genre: Research Report > New Finding (0.54)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Synthesis (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.89)
Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (0.83)
(2 more...)

Add feedback

Robust Loop Closure by Textual Cues in Challenging Environments

Jin, Tongxing, Nguyen, Thien-Minh, Xu, Xinhang, Yang, Yizhuo, Yuan, Shenghai, Li, Jianping, Xie, Lihua

arXiv.org Artificial IntelligenceOct-21-2024

Loop closure is an important task in robot navigation. However, existing methods mostly rely on some implicit or heuristic features of the environment, which can still fail to work in common environments such as corridors, tunnels, and warehouses. Indeed, navigating in such featureless, degenerative, and repetitive (FDR) environments would also pose a significant challenge even for humans, but explicit text cues in the surroundings often provide the best assistance. This inspires us to propose a multi-modal loop closure method based on explicit human-readable textual cues in FDR environments. Specifically, our approach first extracts scene text entities based on Optical Character Recognition (OCR), then creates a local map of text cues based on accurate LiDAR odometry and finally identifies loop closure events by a graph-theoretic scheme. Experiment results demonstrate that this approach has superior performance over existing methods that rely solely on visual and LiDAR sensors. To benefit the community, we release the source code and datasets at \url{https://github.com/TongxingJin/TXTLCD}.

artificial intelligence, machine learning, pattern recognition, (17 more...)

arXiv.org Artificial Intelligence

2410.15869

Country:

Asia > Singapore (0.04)
Europe > Greece (0.04)

Genre: Research Report > New Finding (0.34)

Technology:

Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (0.54)
Information Technology > Artificial Intelligence > Machine Learning > Pattern Recognition (0.48)

Add feedback