Optical Character Recognition
Computer Vision Intelligence Test Modeling and Generation: A Case Study on Smart OCR
Shu, Jing, Miu, Bing-Jiun, Chang, Eugene, Gao, Jerry, Liu, Jun
AI-based systems possess distinctive characteristics and introduce challenges in quality evaluation at the same time. Consequently, ensuring and validating AI software quality is of critical importance. In this paper, we present an effective AI software functional testing model to address this challenge. Specifically, we first present a comprehensive literature review of previous work, covering key facets of AI software testing processes. We then introduce a 3D classification model to systematically evaluate the image-based text extraction AI function, as well as test coverage criteria and complexity. To evaluate the performance of our proposed AI software quality test, we propose four evaluation metrics to cover different aspects. Finally, based on the proposed framework and defined metrics, a mobile Optical Character Recognition (OCR) case study is presented to demonstrate the framework's effectiveness and capability in assessing AI function quality.
Zero-Shot Text-to-Speech as Golden Speech Generator: A Systematic Framework and its Applicability in Automatic Pronunciation Assessment
Lo, Tien-Hong, Tsai, Meng-Ting, Chen, Berlin
Second language (L2) learners can improve their pronunciation by imitating golden speech, especially when the speech that aligns with their respective speech characteristics. This study explores the hypothesis that learner-specific golden speech generated with zero-shot text-to-speech (ZS-TTS) techniques can be harnessed as an effective metric for measuring the pronunciation proficiency of L2 learners. Building on this exploration, the contributions of this study are at least two-fold: 1) design and development of a systematic framework for assessing the ability of a synthesis model to generate golden speech, and 2) in-depth investigations of the effectiveness of using golden speech in automatic pronunciation assessment (APA). Comprehensive experiments conducted on the L2-ARCTIC and Speechocean762 benchmark datasets suggest that our proposed modeling can yield significant performance improvements with respect to various assessment metrics in relation to some prior arts. To our knowledge, this study is the first to explore the role of golden speech in both ZS-TTS and APA, offering a promising regime for computer-assisted pronunciation training (CAPT).
Cross-Dialect Text-To-Speech in Pitch-Accent Language Incorporating Multi-Dialect Phoneme-Level BERT
Yamauchi, Kazuki, Saito, Yuki, Saruwatari, Hiroshi
We explore cross-dialect text-to-speech (CD-TTS), a task to synthesize learned speakers' voices in non-native dialects, especially in pitch-accent languages. CD-TTS is important for developing voice agents that naturally communicate with people across regions. We present a novel TTS model comprising three sub-modules to perform competitively at this task. We first train a backbone TTS model to synthesize dialect speech from a text conditioned on phoneme-level accent latent variables (ALVs) extracted from speech by a reference encoder. Then, we train an ALV predictor to predict ALVs tailored to a target dialect from input text leveraging our novel multi-dialect phoneme-level BERT. We conduct multi-dialect TTS experiments and evaluate the effectiveness of our model by comparing it with a baseline derived from conventional dialect TTS methods. The results show that our model improves the dialectal naturalness of synthetic speech in CD-TTS.
Confidence-Aware Document OCR Error Detection
Hemmer, Arthur, Coustaty, Mickaël, Bartolo, Nicola, Ogier, Jean-Marc
Optical Character Recognition (OCR) continues to face accuracy challenges that impact subsequent applications. To address these errors, we explore the utility of OCR confidence scores for enhancing post-OCR error detection. Our study involves analyzing the correlation between confidence scores and error rates across different OCR systems. We develop ConfBERT, a BERT-based model that incorporates OCR confidence scores into token embeddings and offers an optional pre-training phase for noise adjustment. Our experimental results demonstrate that integrating OCR confidence scores can enhance error detection capabilities. This work underscores the importance of OCR confidence scores in improving detection accuracy and reveals substantial disparities in performance between commercial and open-source OCR technologies.
A multilingual training strategy for low resource Text to Speech
Amalas, Asma, Ghogho, Mounir, Chetouani, Mohamed, Thami, Rachid Oulad Haj
Recent speech technologies have led to produce high quality synthesised speech due to recent advances in neural Text to Speech (TTS). However, such TTS models depend on extensive amounts of data that can be costly to produce and is hardly scalable to all existing languages, especially that seldom attention is given to low resource languages. With techniques such as knowledge transfer, the burden of creating datasets can be alleviated. In this paper, we therefore investigate two aspects; firstly, whether data from social media can be used for a small TTS dataset construction, and secondly whether cross lingual transfer learning (TL) for a low resource language can work with this type of data. In this aspect, we specifically assess to what extent multilingual modeling can be leveraged as an alternative to training on monolingual corporas. To do so, we explore how data from foreign languages may be selected and pooled to train a TTS model for a target low resource language. Our findings show that multilingual pre-training is better than monolingual pre-training at increasing the intelligibility and naturalness of the generated speech.
A Permuted Autoregressive Approach to Word-Level Recognition for Urdu Digital Text
Mustafa, Ahmed, Rafique, Muhammad Tahir, Baig, Muhammad Ijlal, Sajid, Hasan, Khan, Muhammad Jawad, Kallu, Karam Dad
This research paper introduces a novel word-level Optical Character Recognition (OCR) model specifically designed for digital Urdu text, leveraging transformer-based architectures and attention mechanisms to address the distinct challenges of Urdu script recognition, including its diverse text styles, fonts, and variations. The model employs a permuted autoregressive sequence (PARSeq) architecture, which enhances its performance by enabling context-aware inference and iterative refinement through the training of multiple token permutations. This method allows the model to adeptly manage character reordering and overlapping characters, commonly encountered in Urdu script. Trained on a dataset comprising approximately 160,000 Urdu text images, the model demonstrates a high level of accuracy in capturing the intricacies of Urdu script, achieving a CER of 0.178. Despite ongoing challenges in handling certain text variations, the model exhibits superior accuracy and effectiveness in practical applications. Future work will focus on refining the model through advanced data augmentation techniques and the integration of context-aware language models to further enhance its performance and robustness in Urdu text recognition.
SimpleSpeech 2: Towards Simple and Efficient Text-to-Speech with Flow-based Scalar Latent Transformer Diffusion Models
Yang, Dongchao, Huang, Rongjie, Wang, Yuanyuan, Guo, Haohan, Chong, Dading, Liu, Songxiang, Wu, Xixin, Meng, Helen
Scaling Text-to-speech (TTS) to large-scale datasets has been demonstrated as an effective method for improving the diversity and naturalness of synthesized speech. At the high level, previous large-scale TTS models can be categorized into either Auto-regressive (AR) based (\textit{e.g.}, VALL-E) or Non-auto-regressive (NAR) based models (\textit{e.g.}, NaturalSpeech 2/3). Although these works demonstrate good performance, they still have potential weaknesses. For instance, AR-based models are plagued by unstable generation quality and slow generation speed; meanwhile, some NAR-based models need phoneme-level duration alignment information, thereby increasing the complexity of data pre-processing, model design, and loss design. In this work, we build upon our previous publication by implementing a simple and efficient non-autoregressive (NAR) TTS framework, termed SimpleSpeech 2. SimpleSpeech 2 effectively combines the strengths of both autoregressive (AR) and non-autoregressive (NAR) methods, offering the following key advantages: (1) simplified data preparation; (2) straightforward model and loss design; and (3) stable, high-quality generation performance with fast inference speed. Compared to our previous publication, we present ({\romannumeral1}) a detailed analysis of the influence of speech tokenizer and noisy label for TTS performance; ({\romannumeral2}) four distinct types of sentence duration predictors; ({\romannumeral3}) a novel flow-based scalar latent transformer diffusion model. With these improvement, we show a significant improvement in generation performance and generation speed compared to our previous work and other state-of-the-art (SOTA) large-scale TTS models. Furthermore, we show that SimpleSpeech 2 can be seamlessly extended to multilingual TTS by training it on multilingual speech datasets. Demos are available on: {https://dongchaoyang.top/SimpleSpeech2\_demo/}.
Ancient but Digitized: Developing Handwritten Optical Character Recognition for East Syriac Script Through Creating KHAMIS Dataset
Majeed, Ameer, Hassani, Hossein
Many languages have vast amounts of handwritten texts, such as ancient scripts about folktale stories and historical narratives or contemporary documents and letters. Digitization of those texts has various applications, such as daily tasks, cultural studies, and historical research. Syriac is an ancient, endangered, and low-resourced language that has not received the attention it requires and deserves. This paper reports on a research project aimed at developing a optical character recognition (OCR) model based on the handwritten Syriac texts as a starting point to build more digital services for this endangered language. A dataset was created, KHAMIS (inspired by the East Syriac poet, Khamis bar Qardahe), which consists of handwritten sentences in the East Syriac script. We used it to fine-tune the Tesseract-OCR engine's pretrained Syriac model on handwritten data. The data was collected from volunteers capable of reading and writing in the language to create KHAMIS. KHAMIS currently consists of 624 handwritten Syriac sentences collected from 31 university students and one professor, and it will be partially available online and the whole dataset available in the near future for development and research purposes. As a result, the handwritten OCR model was able to achieve a character error rate of 1.097-1.610% and 8.963-10.490% on both training and evaluation sets, respectively, and both a character error rate of 18.89-19.71% and a word error rate of 62.83-65.42% when evaluated on the test set, which is twice as better than the default Syriac model of Tesseract.
Empirical Analysis of Large Vision-Language Models against Goal Hijacking via Visual Prompt Injection
Kimura, Subaru, Tanaka, Ryota, Miyawaki, Shumpei, Suzuki, Jun, Sakaguchi, Keisuke
We explore visual prompt injection (VPI) that maliciously exploits the ability of large vision-language models (LVLMs) to follow instructions drawn onto the input image. We propose a new VPI method, "goal hijacking via visual prompt injection" (GHVPI), that swaps the execution task of LVLMs from an original task to an alternative task designated by an attacker. The quantitative analysis indicates that GPT-4V is vulnerable to the GHVPI and demonstrates a notable attack success rate of 15.8%, which is an unignorable security risk. Our analysis also shows that successful GHVPI requires high character recognition capability and instruction-following ability in LVLMs.
Handwritten Code Recognition for Pen-and-Paper CS Education
Islam, Md Sazzad, Doumbouya, Moussa Koulako Bala, Manning, Christopher D., Piech, Chris
Teaching Computer Science (CS) by having students write programs by hand on paper has key pedagogical advantages: It allows focused learning and requires careful thinking compared to the use of Integrated Development Environments (IDEs) with intelligent support tools or "just trying things out". The familiar environment of pens and paper also lessens the cognitive load of students with no prior experience with computers, for whom the mere basic usage of computers can be intimidating. Finally, this teaching approach opens learning opportunities to students with limited access to computers. However, a key obstacle is the current lack of teaching methods and support software for working with and running handwritten programs. Optical character recognition (OCR) of handwritten code is challenging: Minor OCR errors, perhaps due to varied handwriting styles, easily make code not run, and recognizing indentation is crucial for languages like Python but is difficult to do due to inconsistent horizontal spacing in handwriting. Our approach integrates two innovative methods. The first combines OCR with an indentation recognition module and a language model designed for post-OCR error correction without introducing hallucinations. This method, to our knowledge, surpasses all existing systems in handwritten code recognition. It reduces error from 30\% in the state of the art to 5\% with minimal hallucination of logical fixes to student programs. The second method leverages a multimodal language model to recognize handwritten programs in an end-to-end fashion. We hope this contribution can stimulate further pedagogical research and contribute to the goal of making CS education universally accessible. We release a dataset of handwritten programs and code to support future research at https://github.com/mdoumbouya/codeocr