Goto

Collaborating Authors

 Optical Character Recognition


Mixed-Phoneme BERT: Improving BERT with Mixed Phoneme and Sup-Phoneme Representations for Text to Speech

arXiv.org Artificial Intelligence

Recently, leveraging BERT pre-training to improve the phoneme encoder in text to speech (TTS) has drawn increasing attention. However, the works apply pre-training with character-based units to enhance the TTS phoneme encoder, which is inconsistent with the TTS fine-tuning that takes phonemes as input. Pre-training only with phonemes as input can alleviate the input mismatch but lack the ability to model rich representations and semantic information due to limited phoneme vocabulary. In this paper, we propose MixedPhoneme BERT, a novel variant of the BERT model that uses mixed phoneme and sup-phoneme representations to enhance the learning capability. Specifically, we merge the adjacent phonemes into sup-phonemes and combine the phoneme sequence and the merged sup-phoneme sequence as the model input, which can enhance the model capacity to learn rich contextual representations. Experiment results demonstrate that our proposed Mixed-Phoneme BERT significantly improves the TTS performance with 0.30 CMOS gain compared with the FastSpeech 2 baseline. The Mixed-Phoneme BERT achieves 3x inference speedup and similar voice quality to the previous TTS pre-trained model PnG BERT


Automation Artificial Intelligence Booms in Uncertain Economic

#artificialintelligence

As economic concerns increase, many companies begin to reduce their staff to control costs; 88 percent of job loss in routine occupations occurs within 12 months of a recession. While economicda uncertainty continues, Veryfi has emerged as a trusted, reliable partner for companies seeking greater efficiency and stronger customer relationships, continuing its strong annual recurring revenue (ARR) growth. In the second quarter, Veryfi added over a dozen new logos and major accounts including a top supplier of enterprise resource planning software and one of the world's largest CRM/Direct Marketing Network companies. "As companies seek new ways to increase efficiency and manage costs to position themselves for a challenging economy, Veryfi is leading the way, applying AI to automate routine data entry and streamline business processes," said Ernest Semerda, co-founder and CEO of Veryfi. "In Q2, we welcomed over a dozen new customers and multiple strategic accounts spanning key use cases from loyalty marketing to intelligent automation for accounts payable. We are seeing cross-market demand for our Veryfi technology."


ProDiff: Progressive Fast Diffusion Model For High-Quality Text-to-Speech

arXiv.org Artificial Intelligence

Denoising diffusion probabilistic models (DDPMs) have recently achieved leading performances in many generative tasks. However, the inherited iterative sampling process costs hinder their applications to text-to-speech deployment. Through the preliminary study on diffusion model parameterization, we find that previous gradient-based TTS models require hundreds or thousands of iterations to guarantee high sample quality, which poses a challenge for accelerating sampling. In this work, we propose ProDiff, on progressive fast diffusion model for high-quality text-to-speech. Unlike previous work estimating the gradient for data density, ProDiff parameterizes the denoising model by directly predicting clean data to avoid distinct quality degradation in accelerating sampling. To tackle the model convergence challenge with decreased diffusion iterations, ProDiff reduces the data variance in the target site via knowledge distillation. Specifically, the denoising model uses the generated mel-spectrogram from an N-step DDIM teacher as the training target and distills the behavior into a new model with N/2 steps. As such, it allows the TTS model to make sharp predictions and further reduces the sampling time by orders of magnitude. Our evaluation demonstrates that ProDiff needs only 2 iterations to synthesize high-fidelity mel-spectrograms, while it maintains sample quality and diversity competitive with state-of-the-art models using hundreds of steps. ProDiff enables a sampling speed of 24x faster than real-time on a single NVIDIA 2080Ti GPU, making diffusion models practically applicable to text-to-speech synthesis deployment for the first time. Our extensive ablation studies demonstrate that each design in ProDiff is effective, and we further show that ProDiff can be easily extended to the multi-speaker setting. Audio samples are available at \url{https://ProDiff.github.io/.}


AWS Amazon Polly โ€“ Text to Speech Converter

#artificialintelligence

Detailed and Comprehensive Documentation Cloud Vendor Text to Speech Prices Notes Please note, for the script to work correctly, you need to have valid AWS account. Latest Changes 22.04.2022 - 2.0 - New: Full redesign with Laravel Framework - New: Powerful integrated Sound Studio - New: Mixing up to 20 voices in a single synthesize task


LIP: Lightweight Intelligent Preprocessor for meaningful text-to-speech

arXiv.org Artificial Intelligence

Existing Text-to-Speech (TTS) systems need to read messages from the email which may have Personal Identifiable Information (PII) to text messages that can have a streak of emojis and punctuation. 92% of the world's online population use emoji with more than 10 billion emojis sent everyday. Lack of preprocessor leads to messages being read as-is including punctuation and infographics like emoticons. This problem worsens if there is a continuous sequence of punctuation/emojis that are quite common in real-world communications like messaging, Social Networking Site (SNS) interactions, etc. In this work, we aim to introduce a lightweight intelligent preprocessor (LIP) that can enhance the readability of a message before being passed downstream to existing TTS systems. We propose multiple sub-modules including: expanding contraction, censoring swear words, and masking of PII, as part of our preprocessor to enhance the readability of text. With a memory footprint of only 3.55 MB and inference time of 4 ms for up to 50-character text, our solution is suitable for real-time deployment. This work being the first of its kind, we try to benchmark with an open independent survey, the result of which shows 76.5% preference towards LIP enabled TTS engine as compared to standard TTS.


This AI powered text-to-speech tool makes voiceovers sound true to life

PCWorld

The problem is that they often make videos sound robotic and lifeless, which is never good. Wish there was a better option? Then check out Speechnow, an AI-powered tool that makes video voiceovers sound true to life. Speechnow is a browser app that uses an AI algorithm to convert text into spoken word recordings. And it makes those recordings sound as if an actual human spoke them, so it's ideal for people who post a lot of videos to their socials.


Towards Multimodal Vision-Language Models Generating Non-Generic Text

arXiv.org Artificial Intelligence

Vision-language models can assess visual context in an image and generate descriptive text. While the generated text may be accurate and syntactically correct, it is often overly general. To address this, recent work has used optical character recognition to supplement visual information with text extracted from an image. In this work, we contend that vision-language models can benefit from additional information that can be extracted from an image, but are not used by current models. We modify previous multimodal frameworks to accept relevant information from any number of auxiliary classifiers. In particular, we focus on person names as an additional set of tokens and create a novel image-caption dataset to facilitate captioning with person names. The dataset, Politicians and Athletes in Captions (PAC), consists of captioned images of well-known people in context. By fine-tuning pretrained models with this dataset, we demonstrate a model that can naturally integrate facial recognition tokens into generated text by training on limited data. For the PAC dataset, we provide a discussion on collection and baseline benchmark scores.


Bhasha Daan : An crowdsourcing initiative for Indian languages

#artificialintelligence

Bhasha Daan: An crowdsourcing initiative for Indian languages that will be as Indian, as you and I. We invite you to contribute data to develop Speech Recognition, Text-to-Speech, Machine Translation and Optical Character Recognition for Indian languages.


Optical Character Recognition Technology for Business Owners

#artificialintelligence

Early versions of OCR had to be trained with images of each character and could only work with one font at a time. Modern machine learning algorithms make the text recognition process more advanced and provide a higher level of recognition accuracy for most fonts, regardless of input data formats. Advances in machine learning (ML) have given a new impetus to the development of OCR, significantly increasing the number of its applications. With enough training data, the OCR machine learning algorithm now can be applied to any real-world scenario that requires identification and text transformation. For example, receipts scanning, scanning of printed text with the further conversion of it into synthetic speech, traffic sign recognition, license plate recognition, etc.


SANE-TTS: Stable And Natural End-to-End Multilingual Text-to-Speech

arXiv.org Artificial Intelligence

In this paper, we present SANE-TTS, a stable and natural end-to-end multilingual TTS model. By the difficulty of obtaining multilingual corpus for given speaker, training multilingual TTS model with monolingual corpora is unavoidable. We introduce speaker regularization loss that improves speech naturalness during cross-lingual synthesis as well as domain adversarial training, which is applied in other multilingual TTS models. Furthermore, by adding speaker regularization loss, replacing speaker embedding with zero vector in duration predictor stabilizes cross-lingual inference. With this replacement, our model generates speeches with moderate rhythm regardless of source speaker in cross-lingual synthesis. In MOS evaluation, SANE-TTS achieves naturalness score above 3.80 both in cross-lingual and intralingual synthesis, where the ground truth score is 3.99. Also, SANE-TTS maintains speaker similarity close to that of ground truth even in cross-lingual inference. Audio samples are available on our web page.