AITopics | Optical Character Recognition

Collaborating Authors

Optical Character Recognition

Our second example deals with a more challenging problem: the recognition of hand-printed letters of the alphabet. The characters that people print in the ordinary course of filling out forms and questionnaires are surprisingly varied. Gaps abound wherecontinuous lines might be expected; curves and sharp angles appear interchangeably; there is almost every imaginable distortion of slant, shape and size. Even human readers cannot always identify such characters; their error rate is about 3 per cent on randomly selected letters and numbers, seen out of context.
– from Oliver G. Selfridge & Ulric Neisser. PATTERN RECOGNITION BY MACHINE . In Computers & thought, Edward A. Feigenbaum and Julian Feldman (Eds.). MIT Press, Cambridge, MA, USA, 1963. pp. 8-30.

News Overviews Instructional Materials AI-Alerts Classics

Improving Inference Performance of Machine Learning with the Divide-and-Conquer Principle

Kogan, Alex

arXiv.org Artificial IntelligenceMar-1-2023

Many popular machine learning models scale poorly when deployed on CPUs. In this paper we explore the reasons why and propose a simple, yet effective approach based on the well-known Divide-and-Conquer Principle to tackle this problem of great practical importance. Given an inference job, instead of using all available computing resources (i.e., CPU cores) for running it, the idea is to break the job into independent parts that can be executed in parallel, each with the number of cores according to its expected computational cost. We implement this idea in the popular OnnxRuntime framework and evaluate its effectiveness with several use cases, including the well-known models for optical character recognition (PaddleOCR) and natural language processing (BERT).

machine learning, natural language, pattern recognition, (23 more...)

arXiv.org Artificial Intelligence

2301.05099

Country: Asia > Vietnam > Long An Province (0.04)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)
Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (0.68)
(2 more...)

Add feedback

ClArTTS: An Open-Source Classical Arabic Text-to-Speech Corpus

Kulkarni, Ajinkya, Kulkarni, Atharva, Shatnawi, Sara Abedalmonem Mohammad, Aldarmaki, Hanan

arXiv.org Artificial IntelligenceFeb-28-2023

At present, Text-to-speech (TTS) systems that are trained with high-quality transcribed speech data using end-to-end neural models can generate speech that is intelligible, natural, and closely resembles human speech. These models are trained with relatively large single-speaker professionally recorded audio, typically extracted from audiobooks. Meanwhile, due to the scarcity of freely available speech corpora of this kind, a larger gap exists in Arabic TTS research and development. Most of the existing freely available Arabic speech corpora are not suitable for TTS training as they contain multi-speaker casual speech with variations in recording conditions and quality, whereas the corpus curated for speech synthesis are generally small in size and not suitable for training state-of-the-art end-to-end models. In a move towards filling this gap in resources, we present a speech corpus for Classical Arabic Text-to-Speech (ClArTTS) to support the development of end-to-end TTS systems for Arabic. The speech is extracted from a LibriVox audiobook, which is then processed, segmented, and manually transcribed and annotated. The final ClArTTS corpus contains about 12 hours of speech from a single male speaker sampled at 40100 kHz. In this paper, we describe the process of corpus creation and provide details of corpus statistics and a comparison with existing resources. Furthermore, we develop two TTS systems based on Grad-TTS and Glow-TTS and illustrate the performance of the resulting systems via subjective and objective evaluations. The corpus will be made publicly available at www.clartts.com for research purposes, along with the baseline TTS systems demo.

artificial intelligence, corpus, optical character recognition, (15 more...)

arXiv.org Artificial Intelligence

2303.00069

Country:

Asia > Middle East > UAE (0.04)
Asia > India (0.04)

Genre: Research Report (0.82)

Industry: Media > Publishing (0.57)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Synthesis (1.00)
Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (0.83)

Add feedback

Imaginary Voice: Face-styled Diffusion Model for Text-to-Speech

Lee, Jiyoung, Chung, Joon Son, Chung, Soo-Whan

arXiv.org Artificial IntelligenceFeb-27-2023

The goal of this work is zero-shot text-to-speech synthesis, with speaking styles and voices learnt from facial characteristics. Inspired by the natural fact that people can imagine the voice of someone when they look at his or her face, we introduce a face-styled diffusion text-to-speech (TTS) model within a unified framework learnt from visible attributes, called Face-TTS. This is the first time that face images are used as a condition to train a TTS model. We jointly train cross-model biometrics and TTS models to preserve speaker identity between face images and generated speech segments. We also propose a speaker feature binding loss to enforce the similarity of the generated and the ground truth speech segments in speaker embedding space. Since the biometric information is extracted directly from the face image, our method does not require extra fine-tuning steps to generate speech from unseen and unheard speakers. We train and evaluate the model on the LRS3 dataset, an in-the-wild audio-visual corpus containing background noise and diverse speaking styles. The project page is https://facetts.github.io.

artificial intelligence, face image, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2302.137

Country:

Asia > South Korea (0.05)
Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)

Genre: Research Report (0.50)

Industry: Information Technology > Security & Privacy (0.66)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Synthesis (1.00)
Information Technology > Artificial Intelligence > Vision > Face Recognition (0.95)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.94)
Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (0.83)

Add feedback

User-Centric Evaluation of OCR Systems for Kwak'wala

Rijhwani, Shruti, Rosenblum, Daisy, King, Michayla, Anastasopoulos, Antonios, Neubig, Graham

arXiv.org Artificial IntelligenceFeb-26-2023

There has been recent interest in improving optical character recognition (OCR) for endangered languages, particularly because a large number of documents and books in these languages are not in machine-readable formats. The performance of OCR systems is typically evaluated using automatic metrics such as character and word error rates. While error rates are useful for the comparison of different models and systems, they do not measure whether and how the transcriptions produced from OCR tools are useful to downstream users. In this paper, we present a human-centric evaluation of OCR systems, focusing on the Kwak'wala language as a case study. With a user study, we show that utilizing OCR reduces the time spent in the manual transcription of culturally valuable documents -- a task that is often undertaken by endangered language community members and researchers -- by over 50%. Our results demonstrate the potential benefits that OCR tools can have on downstream language documentation and revitalization efforts.

artificial intelligence, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2302.1341

Country:

North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.14)
North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
North America > United States > New York (0.04)
(9 more...)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry: Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Samsung's Bixby now supports text-to-speech in English calls

EngadgetFeb-22-2023, 10:35:39 GMT

Last year, Samsung introduced a feature called "Text Call" for Bixby with One UI 5, which essentially transforms voice calls into written text and vice versa. It was initially available in Korean, but now the company has launched support for the feature in (US) English. The feature lets users answer calls by typing a message that Bixby will then read out loud to the caller. It can also transcribe what the caller says, making it a pretty useful tool for those hard of hearing or for anyone taking a call in a noisy environment. While Bixby has several voice options, Samsung is giving users the capability to personalize the voice it uses to answer calls.

artificial intelligence, optical character recognition, samsung, (7 more...)

Engadget

Industry: Semiconductors & Electronics (0.99)

Technology:

Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (0.40)
Information Technology > Artificial Intelligence > Speech > Speech Synthesis (0.40)
Information Technology > Artificial Intelligence > Assistive Technologies (0.40)

Add feedback

Lightweight and High-Fidelity End-to-End Text-to-Speech with Multi-Band Generation and Inverse Short-Time Fourier Transform

Kawamura, Masaya, Shirahata, Yuma, Yamamoto, Ryuichi, Tachibana, Kentaro

arXiv.org Artificial IntelligenceFeb-21-2023

We propose a lightweight end-to-end text-to-speech model using multi-band generation and inverse short-time Fourier transform. Our model is based on VITS, a high-quality end-to-end text-to-speech model, but adopts two changes for more efficient inference: 1) the most computationally expensive component is partially replaced with a simple inverse short-time Fourier transform, and 2) multi-band generation, with fixed or trainable synthesis filters, is used to generate waveforms. Unlike conventional lightweight models, which employ optimization or knowledge distillation separately to train two cascaded components, our method enjoys the full benefits of end-to-end optimization. Experimental results show that our model synthesized speech as natural as that synthesized by VITS, while achieving a real-time factor of 0.066 on an Intel Core i7 CPU, 4.1 times faster than VITS. Moreover, a smaller version of the model significantly outperformed a lightweight baseline model with respect to both naturalness and inference speed. Code and audio samples are available from https://github.com/MasayaKawamura/MB-iSTFT-VITS.

artificial intelligence, data quality, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2210.15975

Country: Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.14)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.95)
Information Technology > Artificial Intelligence > Speech > Speech Synthesis (0.94)
Information Technology > Data Science > Data Quality > Data Transformation (0.91)
(2 more...)

Add feedback

EmoDiff: Intensity Controllable Emotional Text-to-Speech with Soft-Label Guidance

Guo, Yiwei, Du, Chenpeng, Chen, Xie, Yu, Kai

arXiv.org Artificial IntelligenceFeb-16-2023

Although current neural text-to-speech (TTS) models are able to generate high-quality speech, intensity controllable emotional TTS is still a challenging task. Most existing methods need external optimizations for intensity calculation, leading to suboptimal results or degraded quality. In this paper, we propose EmoDiff, a diffusion-based TTS model where emotion intensity can be manipulated by a proposed soft-label guidance technique derived from classifier guidance. Specifically, instead of being guided with a one-hot vector for the specified emotion, EmoDiff is guided with a soft label where the value of the specified emotion and \textit{Neutral} is set to $\alpha$ and $1-\alpha$ respectively. The $\alpha$ here represents the emotion intensity and can be chosen from 0 to 1. Our experiments show that EmoDiff can precisely control the emotion intensity while maintaining high voice quality. Moreover, diverse speech with specified emotion intensity can be generated by sampling in the reverse denoising process.

artificial intelligence, classifier, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2211.09496

Country:

Asia > China > Shanghai > Shanghai (0.05)
Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Synthesis (0.88)
Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (0.62)

Add feedback

Noisy Parallel Data Alignment

Xie, Ruoyu, Anastasopoulos, Antonios

arXiv.org Artificial IntelligenceFeb-10-2023

An ongoing challenge in current natural language processing is how its major advancements tend to disproportionately favor resource-rich languages, leaving a significant number of under-resourced languages behind. Due to the lack of resources required to train and evaluate models, most modern language technologies are either nonexistent or unreliable to process endangered, local, and non-standardized languages. Optical character recognition (OCR) is often used to convert endangered language documents into machine-readable data. However, such OCR output is typically noisy, and most word alignment models are not built to work under such noisy conditions. In this work, we study the existing word-level alignment models under noisy settings and aim to make them more robust to noisy data. Our noise simulation and structural biasing method, tested on multiple language pairs, manages to reduce the alignment error rate on a state-of-the-art neural-based alignment model up to 59.6%.

artificial intelligence, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2301.09685

Country:

Africa > Middle East > Egypt > Giza Governorate > Giza (0.05)
South America > Peru (0.04)
North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
(11 more...)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

9 Ways We Use AI In Our Products - Liwaiwai

#artificialintelligenceFeb-4-2023, 07:20:09 GMT

The past few years have seen huge breakthroughs in the use and application of artificial intelligence -- and AI holds major promise for people around the world. AI already powers Google's core products that help billions of people every day. Whether it's asking for movie times, finding the nearest doctor or finding better routes home -- our work in AI is centered on making everyday experiences more helpful. We've been developing AI for more than two decades. Some of our most popular products at Google -- like Lens and Translate -- were built entirely using artificial intelligence technologies like optical character recognition and machine learning.

language model, liwaiwai, use ai, (1 more...)

#artificialintelligence

Industry: Information Technology > Services (0.40)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (0.77)
Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (0.60)
Information Technology > Artificial Intelligence > Applied AI (0.60)

Add feedback

The ultimate guide to building ANPR systems using computer vision

#artificialintelligenceFeb-3-2023, 18:55:35 GMT

The extraordinary technological advances have enabled the development of numerous helpful tools and techniques to alleviate human effort. Automatic Number Plate Recognition (ANPR), one such technology, is quickly gaining global prevalence and offers an abundance of advantages. It recognizes license plates and can be used for traffic enforcement, parking management, and many other activities depending on user demands. ANPR systems are highly reliable and built with cutting-edge technologies like artificial intelligence (AR), enabling them to be precise and functional. Thus, this blog post will discuss some key aspects of how the ANPR system works to provide you with a clear understanding of the mechanics of the ANPR system.

anpr system, opération, recognition, (12 more...)

#artificialintelligence

Industry: Law Enforcement & Public Safety > Crime Prevention & Enforcement (0.36)

Technology: Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (0.90)

Add feedback