AITopics | Optical Character Recognition

Collaborating Authors

Optical Character Recognition

Our second example deals with a more challenging problem: the recognition of hand-printed letters of the alphabet. The characters that people print in the ordinary course of filling out forms and questionnaires are surprisingly varied. Gaps abound wherecontinuous lines might be expected; curves and sharp angles appear interchangeably; there is almost every imaginable distortion of slant, shape and size. Even human readers cannot always identify such characters; their error rate is about 3 per cent on randomly selected letters and numbers, seen out of context.
– from Oliver G. Selfridge & Ulric Neisser. PATTERN RECOGNITION BY MACHINE . In Computers & thought, Edward A. Feigenbaum and Julian Feldman (Eds.). MIT Press, Cambridge, MA, USA, 1963. pp. 8-30.

News Overviews Instructional Materials AI-Alerts Classics

A Permuted Autoregressive Approach to Word-Level Recognition for Urdu Digital Text

Mustafa, Ahmed, Rafique, Muhammad Tahir, Baig, Muhammad Ijlal, Sajid, Hasan, Khan, Muhammad Jawad, Kallu, Karam Dad

arXiv.org Artificial IntelligenceAug-30-2024

This research paper introduces a novel word-level Optical Character Recognition (OCR) model specifically designed for digital Urdu text, leveraging transformer-based architectures and attention mechanisms to address the distinct challenges of Urdu script recognition, including its diverse text styles, fonts, and variations. The model employs a permuted autoregressive sequence (PARSeq) architecture, which enhances its performance by enabling context-aware inference and iterative refinement through the training of multiple token permutations. This method allows the model to adeptly manage character reordering and overlapping characters, commonly encountered in Urdu script. Trained on a dataset comprising approximately 160,000 Urdu text images, the model demonstrates a high level of accuracy in capturing the intricacies of Urdu script, achieving a CER of 0.178. Despite ongoing challenges in handling certain text variations, the model exhibits superior accuracy and effectiveness in practical applications. Future work will focus on refining the model through advanced data augmentation techniques and the integration of context-aware language models to further enhance its performance and robustness in Urdu text recognition.

machine learning, pattern recognition, recognition, (20 more...)

arXiv.org Artificial Intelligence

2408.15119

Country:

Asia > Pakistan > Islamabad Capital Territory > Islamabad (0.06)
Europe > Germany > Hesse > Darmstadt Region > Darmstadt (0.04)

Genre:

Research Report (1.00)
Instructional Material > Online (0.41)
Instructional Material > Course Syllabus & Notes (0.41)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Pattern Recognition (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (0.91)

Add feedback

SimpleSpeech 2: Towards Simple and Efficient Text-to-Speech with Flow-based Scalar Latent Transformer Diffusion Models

Yang, Dongchao, Huang, Rongjie, Wang, Yuanyuan, Guo, Haohan, Chong, Dading, Liu, Songxiang, Wu, Xixin, Meng, Helen

arXiv.org Artificial IntelligenceAug-28-2024

Scaling Text-to-speech (TTS) to large-scale datasets has been demonstrated as an effective method for improving the diversity and naturalness of synthesized speech. At the high level, previous large-scale TTS models can be categorized into either Auto-regressive (AR) based (\textit{e.g.}, VALL-E) or Non-auto-regressive (NAR) based models (\textit{e.g.}, NaturalSpeech 2/3). Although these works demonstrate good performance, they still have potential weaknesses. For instance, AR-based models are plagued by unstable generation quality and slow generation speed; meanwhile, some NAR-based models need phoneme-level duration alignment information, thereby increasing the complexity of data pre-processing, model design, and loss design. In this work, we build upon our previous publication by implementing a simple and efficient non-autoregressive (NAR) TTS framework, termed SimpleSpeech 2. SimpleSpeech 2 effectively combines the strengths of both autoregressive (AR) and non-autoregressive (NAR) methods, offering the following key advantages: (1) simplified data preparation; (2) straightforward model and loss design; and (3) stable, high-quality generation performance with fast inference speed. Compared to our previous publication, we present ({\romannumeral1}) a detailed analysis of the influence of speech tokenizer and noisy label for TTS performance; ({\romannumeral2}) four distinct types of sentence duration predictors; ({\romannumeral3}) a novel flow-based scalar latent transformer diffusion model. With these improvement, we show a significant improvement in generation performance and generation speed compared to our previous work and other state-of-the-art (SOTA) large-scale TTS models. Furthermore, we show that SimpleSpeech 2 can be seamlessly extended to multilingual TTS by training it on multilingual speech datasets. Demos are available on: {https://dongchaoyang.top/SimpleSpeech2\_demo/}.

arxiv preprint arxiv, diffusion model, simplespeech 2, (13 more...)

arXiv.org Artificial Intelligence

2408.13893

Country:

Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
Asia > Japan > Honshū > Tōhoku > Iwate Prefecture > Morioka (0.04)
Asia > China > Hong Kong (0.04)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
(3 more...)

Add feedback

Ancient but Digitized: Developing Handwritten Optical Character Recognition for East Syriac Script Through Creating KHAMIS Dataset

Majeed, Ameer, Hassani, Hossein

arXiv.org Artificial IntelligenceAug-24-2024

Many languages have vast amounts of handwritten texts, such as ancient scripts about folktale stories and historical narratives or contemporary documents and letters. Digitization of those texts has various applications, such as daily tasks, cultural studies, and historical research. Syriac is an ancient, endangered, and low-resourced language that has not received the attention it requires and deserves. This paper reports on a research project aimed at developing a optical character recognition (OCR) model based on the handwritten Syriac texts as a starting point to build more digital services for this endangered language. A dataset was created, KHAMIS (inspired by the East Syriac poet, Khamis bar Qardahe), which consists of handwritten sentences in the East Syriac script. We used it to fine-tune the Tesseract-OCR engine's pretrained Syriac model on handwritten data. The data was collected from volunteers capable of reading and writing in the language to create KHAMIS. KHAMIS currently consists of 624 handwritten Syriac sentences collected from 31 university students and one professor, and it will be partially available online and the whole dataset available in the near future for development and research purposes. As a result, the handwritten OCR model was able to achieve a character error rate of 1.097-1.610% and 8.963-10.490% on both training and evaluation sets, respectively, and both a character error rate of 18.89-19.71% and a word error rate of 62.83-65.42% when evaluated on the test set, which is twice as better than the default Syriac model of Tesseract.

dataset, recognition, syriac, (14 more...)

arXiv.org Artificial Intelligence

2408.13631

Country:

Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
Asia > Middle East > Iraq > Kurdistan Region > Duhok Governorate > Duhok (0.05)
Asia > Middle East > Iraq > Erbil Governorate > Erbil (0.05)
(11 more...)

Genre: Research Report (1.00)

Industry:

Government (0.93)
Education > Educational Setting > Higher Education (0.39)

Technology:

Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Pattern Recognition (0.85)

Add feedback

Empirical Analysis of Large Vision-Language Models against Goal Hijacking via Visual Prompt Injection

Kimura, Subaru, Tanaka, Ryota, Miyawaki, Shumpei, Suzuki, Jun, Sakaguchi, Keisuke

arXiv.org Artificial IntelligenceAug-7-2024

We explore visual prompt injection (VPI) that maliciously exploits the ability of large vision-language models (LVLMs) to follow instructions drawn onto the input image. We propose a new VPI method, "goal hijacking via visual prompt injection" (GHVPI), that swaps the execution task of LVLMs from an original task to an alternative task designated by an attacker. The quantitative analysis indicates that GPT-4V is vulnerable to the GHVPI and demonstrates a notable attack success rate of 15.8%, which is an unignorable security risk. Our analysis also shows that successful GHVPI requires high character recognition capability and instruction-following ability in LVLMs.

instruction, lvlm, prompt injection, (15 more...)

arXiv.org Artificial Intelligence

2408.03554

Country: Asia > Japan > Honshū > Tōhoku (0.05)

Genre: Research Report > New Finding (0.68)

Industry:

Information Technology (0.69)
Law Enforcement & Public Safety > Terrorism (0.66)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.71)
Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (0.50)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.30)

Add feedback

Handwritten Code Recognition for Pen-and-Paper CS Education

Islam, Md Sazzad, Doumbouya, Moussa Koulako Bala, Manning, Christopher D., Piech, Chris

arXiv.org Artificial IntelligenceAug-7-2024

Teaching Computer Science (CS) by having students write programs by hand on paper has key pedagogical advantages: It allows focused learning and requires careful thinking compared to the use of Integrated Development Environments (IDEs) with intelligent support tools or "just trying things out". The familiar environment of pens and paper also lessens the cognitive load of students with no prior experience with computers, for whom the mere basic usage of computers can be intimidating. Finally, this teaching approach opens learning opportunities to students with limited access to computers. However, a key obstacle is the current lack of teaching methods and support software for working with and running handwritten programs. Optical character recognition (OCR) of handwritten code is challenging: Minor OCR errors, perhaps due to varied handwriting styles, easily make code not run, and recognizing indentation is crucial for languages like Python but is difficult to do due to inconsistent horizontal spacing in handwriting. Our approach integrates two innovative methods. The first combines OCR with an indentation recognition module and a language model designed for post-OCR error correction without introducing hallucinations. This method, to our knowledge, surpasses all existing systems in handwritten code recognition. It reduces error from 30\% in the state of the art to 5\% with minimal hallucination of logical fixes to student programs. The second method leverages a multimodal language model to recognize handwritten programs in an end-to-end fashion. We hope this contribution can stimulate further pedagogical research and contribute to the goal of making CS education universally accessible. We release a dataset of handwritten programs and code to support future research at https://github.com/mdoumbouya/codeocr

handwritten code recognition, pen-and-paper cs education

arXiv.org Artificial Intelligence

2408.0722

Genre: Research Report (1.00)

Industry: Education > Curriculum > Subject-Specific Education (0.53)

Technology: Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (0.53)

Add feedback

Focus, Distinguish, and Prompt: Unleashing CLIP for Efficient and Flexible Scene Text Retrieval

Zeng, Gangyan, Zhang, Yuan, Wei, Jin, Yang, Dongbao, Zhang, Peng, Gao, Yiwen, Qin, Xugong, Zhou, Yu

arXiv.org Artificial IntelligenceAug-1-2024

Scene text retrieval aims to find all images containing the query text from an image gallery. Current efforts tend to adopt an Optical Character Recognition (OCR) pipeline, which requires complicated text detection and/or recognition processes, resulting in inefficient and inflexible retrieval. Different from them, in this work we propose to explore the intrinsic potential of Contrastive Language-Image Pre-training (CLIP) for OCR-free scene text retrieval. Through empirical analysis, we observe that the main challenges of CLIP as a text retriever are: 1) limited text perceptual scale, and 2) entangled visual-semantic concepts. To this end, a novel model termed FDP (Focus, Distinguish, and Prompt) is developed. FDP first focuses on scene text via shifting the attention to the text area and probing the hidden text knowledge, and then divides the query text into content word and function word for processing, in which a semantic-aware prompting scheme and a distracted queries assistance module are utilized. Extensive experiments show that FDP significantly enhances the inference speed while achieving better or competitive retrieval accuracy compared to existing methods. Notably, on the IIIT-STR benchmark, FDP surpasses the state-of-the-art model by 4.37% with a 4 times faster speed. Furthermore, additional experiments under phrase-level and attribute-aware scene text retrieval settings validate FDP's particular advantages in handling diverse forms of query text. The source code will be publicly available at https://github.com/Gyann-z/FDP.

query text, retrieval, scene text, (13 more...)

arXiv.org Artificial Intelligence

2408.00441

Country:

Oceania > Australia > Victoria > Melbourne (0.15)
Asia > China > Jiangsu Province > Nanjing (0.04)
North America > United States > New York > New York County > New York City (0.04)
(3 more...)

Genre: Research Report > Promising Solution (0.54)

Industry: Consumer Products & Services > Food, Beverage, Tobacco & Cannabis > Beverages (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.67)
Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (0.54)
Information Technology > Artificial Intelligence > Machine Learning > Pattern Recognition (0.49)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.48)

Add feedback

Qalam : A Multimodal LLM for Arabic Optical Character and Handwriting Recognition

Bhatia, Gagan, Nagoudi, El Moatez Billah, Alwajih, Fakhraddin, Abdul-Mageed, Muhammad

arXiv.org Artificial IntelligenceJul-18-2024

Arabic Optical Character Recognition (OCR) and Handwriting Recognition (HWR) pose unique challenges due to the cursive and context-sensitive nature of the Arabic script. This study introduces Qalam, a novel foundation model designed for Arabic OCR and HWR, built on a SwinV2 encoder and RoBERTa decoder architecture. Our model significantly outperforms existing methods, achieving a Word Error Rate (WER) of just 0.80% in HWR tasks and 1.18% in OCR tasks. We train Qalam on a diverse dataset, including over 4.5 million images from Arabic manuscripts and a synthetic dataset comprising 60k image-text pairs. Notably, Qalam demonstrates exceptional handling of Arabic diacritics, a critical feature in Arabic scripts. Furthermore, it shows a remarkable ability to process high-resolution inputs, addressing a common limitation in current OCR systems. These advancements underscore Qalam's potential as a leading solution for Arabic script recognition, offering a significant leap in accuracy and efficiency.

dataset, qalam, recognition, (12 more...)

arXiv.org Artificial Intelligence

2407.13559

Country:

North America > United States (0.05)
South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
North America > Canada > British Columbia (0.04)
(4 more...)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (1.00)
Information Technology > Artificial Intelligence > Vision > Handwriting Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
(2 more...)

Add feedback

TTSDS -- Text-to-Speech Distribution Score

Minixhofer, Christoph, Klejch, Ondřej, Bell, Peter

arXiv.org Artificial IntelligenceJul-17-2024

Many recently published Text-to-Speech (TTS) systems produce audio close to real speech. However, TTS evaluation needs to be revisited to make sense of the results obtained with the new architectures, approaches and datasets. We propose evaluating the quality of synthetic speech as a combination of multiple factors such as prosody, speaker identity, and intelligibility. Our approach assesses how well synthetic speech mirrors real speech by obtaining correlates of each factor and measuring their distance from both real speech datasets and noise datasets. We benchmark 35 TTS systems developed between 2008 and 2024 and show that our score computed as an unweighted average of factors strongly correlates with the human evaluations from each time period.

correlation, evaluation, speech, (16 more...)

arXiv.org Artificial Intelligence

2407.12707

Country:

Asia > Japan > Honshū > Tōhoku > Iwate Prefecture > Morioka (0.04)
Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.04)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.94)
Information Technology > Artificial Intelligence > Speech > Speech Synthesis (0.88)
(2 more...)

Add feedback

Towards Zero-Shot Text-To-Speech for Arabic Dialects

Doan, Khai Duy, Waheed, Abdul, Abdul-Mageed, Muhammad

arXiv.org Artificial IntelligenceJul-7-2024

Zero-shot multi-speaker text-to-speech (ZS-TTS) systems have advanced for English, however, it still lags behind due to insufficient resources. We address this gap for Arabic, a language of more than 450 million native speakers, by first adapting a sizeable existing dataset to suit the needs of speech synthesis. Additionally, we employ a set of Arabic dialect identification models to explore the impact of pre-defined dialect labels on improving the ZS-TTS model in a multi-dialect setting. Subsequently, we fine-tune the XTTS\footnote{https://docs.coqui.ai/en/latest/models/xtts.html}\footnote{https://medium.com/machine-learns/xtts-v2-new-version-of-the-open-source-text-to-speech-model-af73914db81f}\footnote{https://medium.com/@erogol/xtts-v1-techincal-notes-eb83ff05bdc} model, an open-source architecture. We then evaluate our models on a dataset comprising 31 unseen speakers and an in-house dialectal dataset. Our automated and human evaluation results show convincing performance while capable of generating dialectal speech. Our study highlights significant potential for improvements in this emerging area of research in Arabic.

dataset, preprint, speech, (13 more...)

arXiv.org Artificial Intelligence

2406.16751

Country:

Asia > Middle East > Yemen (0.04)
Asia > Middle East > Palestine (0.04)
Asia > Middle East > Jordan (0.04)
(12 more...)

Genre: Research Report > New Finding (0.48)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Synthesis (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (0.83)

Add feedback

Historical Ink: 19th Century Latin American Spanish Newspaper Corpus with LLM OCR Correction

Manrique-Gómez, Laura, Montes, Tony, Manrique, Rubén

arXiv.org Artificial IntelligenceJul-3-2024

Another substantial as key historical resources, contain a diverse project is the "Digging into Data Challenge". A range of information about political, economic, part of the Transatlantic Partnership for Social Sciences and cultural processes and are abundant due to and Humanities 2016, this initiative yielded focused efforts to preserve them within national a vast collection of 19th-century press materials archives. Indeed, the discipline of Digital Humanities, known as "Atlas - Oceanic Exchanges. Tracing which emphasizes the incorporation of digital Global Information Networks in Historical Papers" tools in humanities and social sciences research, (Exchanges). Other significant works include "Viral has spent much of the past three decades on the Texts: Mapping Networks of Reprinting in 19th-task of digitization, resulting in a wealth of curated Century Newspapers and Magazines" (Cordell and digital collections (Berry and Fagerjord, 2017; Dobson, Smith), a project that investigates 19th-century journalistic 2019). However, digitizing these corpora has reports to understand the culture of reprinting brought plenty of challenges in transcribing the in the United States before the Civil War, and images into machine-readable texts.

correction, dataset, ocr error, (10 more...)

arXiv.org Artificial Intelligence

2407.12838

Country:

North America > Panama (0.05)
South America > Venezuela (0.05)
South America > Colombia > Bogotá D.C. > Bogotá (0.05)
(7 more...)

Genre: Research Report (0.40)

Industry: Media > News (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.69)
Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (0.47)

Add feedback