Goto

Collaborating Authors

 Optical Character Recognition


Lightweight and High-Fidelity End-to-End Text-to-Speech with Multi-Band Generation and Inverse Short-Time Fourier Transform

arXiv.org Artificial Intelligence

We propose a lightweight end-to-end text-to-speech model using multi-band generation and inverse short-time Fourier transform. Our model is based on VITS, a high-quality end-to-end text-to-speech model, but adopts two changes for more efficient inference: 1) the most computationally expensive component is partially replaced with a simple inverse short-time Fourier transform, and 2) multi-band generation, with fixed or trainable synthesis filters, is used to generate waveforms. Unlike conventional lightweight models, which employ optimization or knowledge distillation separately to train two cascaded components, our method enjoys the full benefits of end-to-end optimization. Experimental results show that our model synthesized speech as natural as that synthesized by VITS, while achieving a real-time factor of 0.066 on an Intel Core i7 CPU, 4.1 times faster than VITS. Moreover, a smaller version of the model significantly outperformed a lightweight baseline model with respect to both naturalness and inference speed. Code and audio samples are available from https://github.com/MasayaKawamura/MB-iSTFT-VITS.


EmoDiff: Intensity Controllable Emotional Text-to-Speech with Soft-Label Guidance

arXiv.org Artificial Intelligence

Although current neural text-to-speech (TTS) models are able to generate high-quality speech, intensity controllable emotional TTS is still a challenging task. Most existing methods need external optimizations for intensity calculation, leading to suboptimal results or degraded quality. In this paper, we propose EmoDiff, a diffusion-based TTS model where emotion intensity can be manipulated by a proposed soft-label guidance technique derived from classifier guidance. Specifically, instead of being guided with a one-hot vector for the specified emotion, EmoDiff is guided with a soft label where the value of the specified emotion and \textit{Neutral} is set to $\alpha$ and $1-\alpha$ respectively. The $\alpha$ here represents the emotion intensity and can be chosen from 0 to 1. Our experiments show that EmoDiff can precisely control the emotion intensity while maintaining high voice quality. Moreover, diverse speech with specified emotion intensity can be generated by sampling in the reverse denoising process.


Noisy Parallel Data Alignment

arXiv.org Artificial Intelligence

An ongoing challenge in current natural language processing is how its major advancements tend to disproportionately favor resource-rich languages, leaving a significant number of under-resourced languages behind. Due to the lack of resources required to train and evaluate models, most modern language technologies are either nonexistent or unreliable to process endangered, local, and non-standardized languages. Optical character recognition (OCR) is often used to convert endangered language documents into machine-readable data. However, such OCR output is typically noisy, and most word alignment models are not built to work under such noisy conditions. In this work, we study the existing word-level alignment models under noisy settings and aim to make them more robust to noisy data. Our noise simulation and structural biasing method, tested on multiple language pairs, manages to reduce the alignment error rate on a state-of-the-art neural-based alignment model up to 59.6%.


9 Ways We Use AI In Our Products - Liwaiwai

#artificialintelligence

The past few years have seen huge breakthroughs in the use and application of artificial intelligence -- and AI holds major promise for people around the world. AI already powers Google's core products that help billions of people every day. Whether it's asking for movie times, finding the nearest doctor or finding better routes home -- our work in AI is centered on making everyday experiences more helpful. We've been developing AI for more than two decades. Some of our most popular products at Google -- like Lens and Translate -- were built entirely using artificial intelligence technologies like optical character recognition and machine learning.


The ultimate guide to building ANPR systems using computer vision

#artificialintelligence

The extraordinary technological advances have enabled the development of numerous helpful tools and techniques to alleviate human effort. Automatic Number Plate Recognition (ANPR), one such technology, is quickly gaining global prevalence and offers an abundance of advantages. It recognizes license plates and can be used for traffic enforcement, parking management, and many other activities depending on user demands. ANPR systems are highly reliable and built with cutting-edge technologies like artificial intelligence (AR), enabling them to be precise and functional. Thus, this blog post will discuss some key aspects of how the ANPR system works to provide you with a clear understanding of the mechanics of the ANPR system.


Hanwha's life insurance wing adopts optical character recognition solution to handle multiple documents

#artificialintelligence

[Courtesy of Upstage]SEOUL -- Hanwha Life Insurance, one of the major life insurance companies in South Korea, adopted optical character recognition solution capable of processing insurance claim documents such as medical expense receipts. The solution created by domestic artificial intelligenc


💣Notes to Self: Optical Character Recognition or Optical Character Reader (OCR)

#artificialintelligence

Optical character recognition (OCR) is a technology that allows computers to recognize and extract text from images, such as scanned documents, photographs, bills, etc. The process involves analyzing the image and identifying the individual characters within it and then converting those characters into machine-readable text. OCR software can be used to automate tasks such as document scanning, business automation, and accessibility technology. OCR software uses complex algorithms and pattern recognition techniques to identify and extract text. OCR technology has evolved over time and now it has the ability to recognize text in multiple languages and different fonts.


Phoneme-Level BERT for Enhanced Prosody of Text-to-Speech with Grapheme Predictions

arXiv.org Artificial Intelligence

Large-scale pre-trained language models have been shown to be helpful in improving the naturalness of text-to-speech (TTS) models by enabling them to produce more naturalistic prosodic patterns. However, these models are usually word-level or sup-phoneme-level and jointly trained with phonemes, making them inefficient for the downstream TTS task where only phonemes are needed. In this work, we propose a phoneme-level BERT (PL-BERT) with a pretext task of predicting the corresponding graphemes along with the regular masked phoneme predictions. Subjective evaluations show that our phoneme-level BERT encoder has significantly improved the mean opinion scores (MOS) of rated naturalness of synthesized speech compared with the state-of-the-art (SOTA) StyleTTS baseline on out-of-distribution (OOD) texts.


On the feasibility of attacking Thai LPR systems with adversarial examples

arXiv.org Artificial Intelligence

Recent advances in deep neural networks (DNNs) have significantly enhanced the capabilities of optical character recognition (OCR) technology, enabling its adoption to a wide range of real-world applications. Despite this success, DNN-based OCR is shown to be vulnerable to adversarial attacks, in which the adversary can influence the DNN model's prediction by carefully manipulating input to the model. Prior work has demonstrated the security impacts of adversarial attacks on various OCR languages. However, to date, no studies have been conducted and evaluated on an OCR system tailored specifically for the Thai language. To bridge this gap, this work presents a feasibility study of performing adversarial attacks on a specific Thai OCR application -- Thai License Plate Recognition (LPR). Moreover, we propose a new type of adversarial attack based on the \emph{semi-targeted} scenario and show that this scenario is highly realistic in LPR applications. Our experimental results show the feasibility of our attacks as they can be performed on a commodity computer desktop with over 90% attack success rate.


UnifySpeech: A Unified Framework for Zero-shot Text-to-Speech and Voice Conversion

arXiv.org Artificial Intelligence

Text-to-speech (TTS) and voice conversion (VC) are two different tasks both aiming at generating high quality speaking voice according to different input modality. Due to their similarity, this paper proposes UnifySpeech, which brings TTS and VC into a unified framework for the first time. The model is based on the assumption that speech can be decoupled into three independent components: content information, speaker information, prosody information. Both TTS and VC can be regarded as mining these three parts of information from the input and completing the reconstruction of speech. For TTS, the speech content information is derived from the text, while in VC it's derived from the source speech, so all the remaining units are shared except for the speech content extraction module in the two tasks. We applied vector quantization and domain constrain to bridge the gap between the content domains of TTS and VC. Objective and subjective evaluation shows that by combining the two task, TTS obtains better speaker modeling ability while VC gets hold of impressive speech content decoupling capability.