AITopics | Optical Character Recognition

Collaborating Authors

Optical Character Recognition

Our second example deals with a more challenging problem: the recognition of hand-printed letters of the alphabet. The characters that people print in the ordinary course of filling out forms and questionnaires are surprisingly varied. Gaps abound wherecontinuous lines might be expected; curves and sharp angles appear interchangeably; there is almost every imaginable distortion of slant, shape and size. Even human readers cannot always identify such characters; their error rate is about 3 per cent on randomly selected letters and numbers, seen out of context.
– from Oliver G. Selfridge & Ulric Neisser. PATTERN RECOGNITION BY MACHINE . In Computers & thought, Edward A. Feigenbaum and Julian Feldman (Eds.). MIT Press, Cambridge, MA, USA, 1963. pp. 8-30.

News Overviews Instructional Materials AI-Alerts Classics

Ubisoft accidentally used text-to-speech to voice a character in the new Prince of Persia game

EngadgetJan-11-2024, 19:34:48 GMT

Ubisoft's Prince of Persia: The Lost Crown launches next week, but players are likely to encounter an amusing bug as they make their way through the game, as reported by IGN. One of the game's NPCs is voiced by a text-to-speech program, complete with the slightly robotic tones we've come to associate with these services. It's not quite Siri or Alexa, but it's close and certainly doesn't fit the game's Persian-inspired setting. The NPC-in-question is a tree spirit named Kalux and seems to be voiced by a TTS program that's available online for free and typically used by streamers. This isn't an "AI is coming for your jobs" type thing, but rather a mistake on Ubisoft's part, as each and every other NPC is attached to a voice actor.

artificial intelligence, optical character recognition, prince, (9 more...)

Engadget

Technology:

Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (0.62)
Information Technology > Artificial Intelligence > Speech > Speech Synthesis (0.62)
Information Technology > Artificial Intelligence > Assistive Technologies (0.62)

Add feedback

Evaluating and Personalizing User-Perceived Quality of Text-to-Speech Voices for Delivering Mindfulness Meditation with Different Physical Embodiments

Shi, Zhonghao, Chen, Han, Velentza, Anna-Maria, Liu, Siqi, Dennler, Nathaniel, O'Connell, Allison, Matarić, Maja

arXiv.org Artificial IntelligenceJan-7-2024

Mindfulness-based therapies have been shown to be effective in improving mental health, and technology-based methods have the potential to expand the accessibility of these therapies. To enable real-time personalized content generation for mindfulness practice in these methods, high-quality computer-synthesized text-to-speech (TTS) voices are needed to provide verbal guidance and respond to user performance and preferences. However, the user-perceived quality of state-of-the-art TTS voices has not yet been evaluated for administering mindfulness meditation, which requires emotional expressiveness. In addition, work has not yet been done to study the effect of physical embodiment and personalization on the user-perceived quality of TTS voices for mindfulness. To that end, we designed a two-phase human subject study. In Phase 1, an online Mechanical Turk between-subject study (N=471) evaluated 3 (feminine, masculine, child-like) state-of-the-art TTS voices with 2 (feminine, masculine) human therapists' voices in 3 different physical embodiment settings (no agent, conversational agent, socially assistive robot) with remote participants. Building on findings from Phase 1, in Phase 2, an in-person within-subject study (N=94), we used a novel framework we developed for personalizing TTS voices based on user preferences, and evaluated user-perceived quality compared to best-rated non-personalized voices from Phase 1. We found that the best-rated human voice was perceived better than all TTS voices; the emotional expressiveness and naturalness of TTS voices were poorly rated, while users were satisfied with the clarity of TTS voices. Surprisingly, by allowing users to fine-tune TTS voice features, the user-personalized TTS voices could perform almost as well as human voices, suggesting user personalization could be a simple and very effective tool to improve user-perceived quality of TTS voice.

embodiment, participant, tts voice, (15 more...)

arXiv.org Artificial Intelligence

doi: 10.1145/3568162.3576987

2401.03581

Country:

North America > United States > California > Los Angeles County > Los Angeles (0.29)
Europe > Sweden > Stockholm > Stockholm (0.05)
North America > United States > New York > New York County > New York City (0.04)
(2 more...)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)
Questionnaire & Opinion Survey (1.00)
Research Report > Strength High (0.68)

Industry:

Health & Medicine > Consumer Health (1.00)
Health & Medicine > Therapeutic Area > Psychiatry/Psychology > Mental Health (0.86)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
(3 more...)

Add feedback

Survey on Publicly Available Sinhala Natural Language Processing Tools and Research

de Silva, Nisansa

arXiv.org Artificial IntelligenceJan-4-2024

Sinhala is the native language of the Sinhalese people who make up the largest ethnic group of Sri Lanka. The language belongs to the globe-spanning language tree, Indo-European. However, due to poverty in both linguistic and economic capital, Sinhala, in the perspective of Natural Language Processing tools and research, remains a resource-poor language which has neither the economic drive its cousin English has nor the sheer push of the law of numbers a language such as Chinese has. A number of research groups from Sri Lanka have noticed this dearth and the resultant dire need for proper tools and research for Sinhala natural language processing. However, due to various reasons, these attempts seem to lack coordination and awareness of each other. The objective of this paper is to fill that gap of a comprehensive literature survey of the publicly available Sinhala natural language tools and research so that the researchers working in this field can better utilize contributions of their peers. As such, we shall be uploading this paper to arXiv and perpetually update it periodically to reflect the advances made in the field.

ieee, international conference, sinhala, (14 more...)

arXiv.org Artificial Intelligence

1906.02358

Country:

Europe > Switzerland > Zürich > Zürich (0.14)
Europe > Finland > Uusimaa > Helsinki (0.04)
North America > United States > New York (0.04)
(13 more...)

Genre:

Overview (1.00)
Research Report > New Finding (0.67)

Industry:

Media > News (1.00)
Information Technology > Services (1.00)
Education (1.00)
(2 more...)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
(12 more...)

Add feedback

Incremental FastPitch: Chunk-based High Quality Text to Speech

Du, Muyang, Liu, Chuan, Lai, Junjie

arXiv.org Artificial IntelligenceJan-3-2024

ABSTRACT Parallel text-to-speech models have been widely applied for real-time speech synthesis, and they offer more controllability and a much faster synthesis process compared with conventional auto-regressive models. Although parallel models have benefits in many aspects, they become naturally unfit for incremental synthesis due to their fully parallel architecture such as transformer. In this work, we propose Incremental FastPitch, a novel FastPitch variant capable of incrementally producing high-quality Mel chunks by improving the architecture with chunk-based FFT blocks, training with receptivefield constrained chunk attention masks, and inference with fixed size past model states. Experimental results show that our proposal can produce speech quality comparable to the parallel FastPitch, with a significant lower latency that allows even lower response time for real-time speech applications. Index Terms-- text-to-speech, speech synthesis, realtime, low-latency, streaming tts 1. INTRODUCTION In recent years, Text-to-Speech (TTS) technology has witnessed Figure 1: Incremental FastPitch, Chunk-based FFT Block, and remarkable advancements, enabling the generation of Chunk Mask for Receptive-Filed Constrained Training natural and expressive speech from text inputs.

fastpitch, incremental fastpitch, inference, (16 more...)

arXiv.org Artificial Intelligence

2401.01755

Country:

North America > United States (0.14)
Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
Asia (0.04)

Genre: Research Report > New Finding (0.34)

Technology:

Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Synthesis (1.00)

Add feedback

Utilizing Neural Transducers for Two-Stage Text-to-Speech via Semantic Token Prediction

Kim, Minchan, Jeong, Myeonghun, Choi, Byoung Jin, Kim, Semin, Lee, Joun Yeop, Kim, Nam Soo

arXiv.org Artificial IntelligenceJan-2-2024

We propose a novel text-to-speech (TTS) framework centered around a neural transducer. Our approach divides the whole TTS pipeline into semantic-level sequence-to-sequence (seq2seq) modeling and fine-grained acoustic modeling stages, utilizing discrete semantic tokens obtained from wav2vec2.0 embeddings. For a robust and efficient alignment modeling, we employ a neural transducer named token transducer for the semantic token prediction, benefiting from its hard monotonic alignment constraints. Subsequently, a non-autoregressive (NAR) speech generator efficiently synthesizes waveforms from these semantic tokens. Additionally, a reference speech controls temporal dynamics and acoustic conditions at each stage. This decoupled framework reduces the training complexity of TTS while allowing each stage to focus on semantic and acoustic modeling. Our experimental results on zero-shot adaptive TTS demonstrate that our model surpasses the baseline in terms of speech quality and speaker similarity, both objectively and subjectively. We also delve into the inference speed and prosody control capabilities of our approach, highlighting the potential of neural transducers in TTS frameworks.

neural transducer, speech, transducer, (15 more...)

arXiv.org Artificial Intelligence

2401.01498

Country:

Asia > South Korea > Seoul > Seoul (0.04)
Oceania > Australia > Victoria > Melbourne (0.04)
North America > United States > California > San Diego County > San Diego (0.04)
(3 more...)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.94)
Information Technology > Artificial Intelligence > Speech > Acoustic Processing (0.87)
(3 more...)

Add feedback

PromptTTS++: Controlling Speaker Identity in Prompt-Based Text-to-Speech Using Natural Language Descriptions

Shimizu, Reo, Yamamoto, Ryuichi, Kawamura, Masaya, Shirahata, Yuma, Doi, Hironori, Komatsu, Tatsuya, Tachibana, Kentaro

arXiv.org Artificial IntelligenceDec-27-2023

We propose PromptTTS++, a prompt-based text-to-speech (TTS) synthesis system that allows control over speaker identity using natural language descriptions. To control speaker identity within the prompt-based TTS framework, we introduce the concept of speaker prompt, which describes voice characteristics (e.g., gender-neutral, young, old, and muffled) designed to be approximately independent of speaking style. Since there is no large-scale dataset containing speaker prompts, we first construct a dataset based on the LibriTTS-R corpus with manually annotated speaker prompts. We then employ a diffusion-based acoustic model with mixture density networks to model diverse speaker factors in the training data. Unlike previous studies that rely on style prompts describing only a limited aspect of speaker individuality, such as pitch, speaking speed, and energy, our method utilizes an additional speaker prompt to effectively learn the mapping from natural language descriptions to the acoustic features of diverse speakers. Our subjective evaluation results show that the proposed method can better control speaker characteristics than the methods without the speaker prompt. Audio samples are available at https://reppy4620.github.io/demo.promptttspp/.

encoder, speaker prompt, style prompt, (14 more...)

arXiv.org Artificial Intelligence

2309.0814

Country:

North America > Canada > Quebec > Montreal (0.04)
Europe > United Kingdom > England > West Midlands > Birmingham (0.04)
Asia > Japan > Honshū > Tōhoku (0.04)

Genre: Research Report > New Finding (0.66)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Synthesis (0.74)
Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (0.63)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.47)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

Advancements and Challenges in Arabic Optical Character Recognition: A Comprehensive Survey

Kasem, Mahmoud SalahEldin, Mahmoud, Mohamed, Kang, Hyun-Soo

arXiv.org Artificial IntelligenceDec-18-2023

Optical character recognition (OCR) is a vital process that involves the extraction of handwritten or printed text from scanned or printed images, converting it into a format that can be understood and processed by machines. This enables further data processing activities such as searching and editing. The automatic extraction of text through OCR plays a crucial role in digitizing documents, enhancing productivity, improving accessibility, and preserving historical records. This paper seeks to offer an exhaustive review of contemporary applications, methodologies, and challenges associated with Arabic Optical Character Recognition (OCR). A thorough analysis is conducted on prevailing techniques utilized throughout the OCR process, with a dedicated effort to discern the most efficacious approaches that demonstrate enhanced outcomes. To ensure a thorough evaluation, a meticulous keyword-search methodology is adopted, encompassing a comprehensive analysis of articles relevant to Arabic OCR, including both backward and forward citation reviews. In addition to presenting cutting-edge techniques and methods, this paper critically identifies research gaps within the realm of Arabic OCR. By highlighting these gaps, we shed light on potential areas for future exploration and development, thereby guiding researchers toward promising avenues in the field of Arabic OCR. The outcomes of this study provide valuable insights for researchers, practitioners, and stakeholders involved in Arabic OCR, ultimately fostering advancements in the field and facilitating the creation of more accurate and efficient OCR systems for the Arabic language.

accuracy, dataset, recognition, (13 more...)

arXiv.org Artificial Intelligence

2312.11812

Country:

Africa > Middle East > Egypt (0.04)
North America > United States > Nevada > Clark County > Las Vegas (0.04)
Europe > Switzerland > Fribourg > Fribourg (0.04)
(4 more...)

Genre:

Research Report > Promising Solution (1.00)
Overview (1.00)

Industry:

Information Technology > Security & Privacy (1.00)
Health & Medicine (1.00)

Technology:

Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.69)

Add feedback

A review-based study on different Text-to-Speech technologies

Chowdhury, Md. Jalal Uddin, Hussan, Ashab

arXiv.org Artificial IntelligenceDec-17-2023

This research paper presents a comprehensive review-based study on various Text-to-Speech (TTS) technologies. TTS technology is an important aspect of human-computer interaction, enabling machines to convert written text into audible speech. The paper examines the different TTS technologies available, including concatenative TTS, formant synthesis TTS, and statistical parametric TTS. The study focuses on comparing the advantages and limitations of these technologies in terms of their naturalness of voice, the level of complexity of the system, and their suitability for different applications. In addition, the paper explores the latest advancements in TTS technology, including neural TTS and hybrid TTS. The findings of this research will provide valuable insights for researchers, developers, and users who want to understand the different TTS technologies and their suitability for specific applications.

international conference, review-based study, speech, (16 more...)

arXiv.org Artificial Intelligence

2312.11563

Country: Asia > Pakistan > Punjab > Lahore Division > Lahore (0.04)

Genre: Research Report > New Finding (0.49)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Synthesis (0.97)
Information Technology > Artificial Intelligence > Machine Learning (0.95)
Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (0.70)

Add feedback

Schrodinger Bridges Beat Diffusion Models on Text-to-Speech Synthesis

Chen, Zehua, He, Guande, Zheng, Kaiwen, Tan, Xu, Zhu, Jun

arXiv.org Artificial IntelligenceDec-6-2023

In text-to-speech (TTS) synthesis, diffusion models have achieved promising generation quality. However, because of the pre-defined data-to-noise diffusion process, their prior distribution is restricted to a noisy representation, which provides little information of the generation target. In this work, we present a novel TTS system, Bridge-TTS, making the first attempt to substitute the noisy Gaussian prior in established diffusion-based TTS methods with a clean and deterministic one, which provides strong structural information of the target. Specifically, we leverage the latent representation obtained from text input as our prior, and build a fully tractable Schrodinger bridge between it and the ground-truth mel-spectrogram, leading to a data-to-data process. Moreover, the tractability and flexibility of our formulation allow us to empirically study the design spaces such as noise schedules, as well as to develop stochastic and deterministic samplers. Experimental results on the LJ-Speech dataset illustrate the effectiveness of our method in terms of both synthesis quality and sampling efficiency, significantly outperforming our diffusion counterpart Grad-TTS in 50-step/1000-step synthesis and strong fast TTS models in few-step scenarios. Project page: https://bridge-tts.github.io/

bridge-tts, diffusion model, international conference, (11 more...)

arXiv.org Artificial Intelligence

2312.03491

Country:

Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
Asia (0.04)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Synthesis (1.00)
Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (0.71)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.67)

Add feedback

Vulnerability Analysis of Transformer-based Optical Character Recognition to Adversarial Attacks

Beerens, Lucas, Higham, Desmond J.

arXiv.org Artificial IntelligenceNov-28-2023

Recent advancements in Optical Character Recognition (OCR) have been driven by transformer-based models. OCR systems are critical in numerous high-stakes domains, yet their vulnerability to adversarial attack remains largely uncharted territory, raising concerns about security and compliance with emerging AI regulations. In this work we present a novel framework to assess the resilience of Transformer-based OCR (TrOCR) models. We develop and assess algorithms for both targeted and untargeted attacks. For the untargeted case, we measure the Character Error Rate (CER), while for the targeted case we use the success ratio. We find that TrOCR is highly vulnerable to untargeted attacks and somewhat less vulnerable to targeted attacks. On a benchmark handwriting data set, untargeted attacks can cause a CER of more than 1 without being noticeable to the eye. With a similar perturbation size, targeted attacks can lead to success rates of around $25\%$ -- here we attacked single tokens, requiring TrOCR to output the tenth most likely token from a large vocabulary.

adversarial attack, perturbation, trocr, (12 more...)

arXiv.org Artificial Intelligence

2311.17128

Country:

Europe > United Kingdom (0.14)
North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
North America > United States > California > Santa Clara County > San Jose (0.04)
(3 more...)

Genre: Research Report > New Finding (0.46)

Industry:

Information Technology > Security & Privacy (1.00)
Government (1.00)

Technology:

Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback