MelNet: A Generative Model for Audio in the Frequency Domain

arXiv.org Machine Learning

Capturing high-level structure in audio waveforms is challenging because a single second of audio spans tens of thousands of timesteps. While long-range dependencies are difficult to model directly in the time domain, we show that they can be more tractably modelled in two-dimensional time-frequency representations such as spectrograms. By leveraging this representational advantage, in conjunction with a highly expressive probabilistic model and a multiscale generation procedure, we design a model capable of generating high-fidelity audio samples which capture structure at timescales that time-domain models have yet to achieve. We apply our model to a variety of audio generation tasks, including unconditional speech generation, music generation, and text-to-speech synthesis---showing improvements over previous approaches in both density estimates and human judgments.


Amazon's Text-To-Speech AI Service Sounds More Natural And Realistic

#artificialintelligence

Amazon enhanced Polly - the cloud-based text-to-speech service - to deliver natural and realistic speech synthesis. The service can now be leveraged to present domain-specific style such as newscast and sportscast. Though text-to-speech existed for more than two decades, it is never used in mainstream media due to the lack of natural and realistic modulation. Except for automated announcements that read out from existing datastores, the technology never replaced human voice and speech. Thanks to the advancements in AI, text-to-speech has evolved to become more natural and realistic to an extent that it may be hard to distinguish it from a human voice.


AWS Polly gains neural voices in U.S. Spanish and Brazilian Portuguese

#artificialintelligence

Months after Amazon launched in general availability Neural Text-To-Speech (NTTS) and newscaster style in Amazon Polly, a cloud service that converts text into speech, the Seattle company today debuted two new NTTS voices in U.S. Spanish and Brazilian Portuguese: "Lupe" and "Camila." Like the U.S. English NTTS voice before them, they mimic things like stress and intonation in speech courtesy by identifying tonal patterns. Neural versions of Camila and Lupe are available in Amazon Web Services' (AWS) U.S. East (N. Standard variants are also available across 18 AWS regions, bringing Polly's total number of voices to 61 across 29 languages and the total number of voices available in both standard and neural versions to 13 across four languages. According to Amazon text-to-speech program manager Marta Smolarek, the new U.S. Spanish voice -- Lupe, which is the third U.S. text-to-speech voice in Polly -- not only speaks Spanish but also handles English and provides a fully bilingual Spanish-English experience.


Google Cloud Platform launches text-to-speech service to compete with AWS Polly

ZDNet

Google Cloud outlined Cloud Text-to-Speech a machine learning service that uses a model by Google's Deepmind subsidiary to analyze raw audio. With the move, developers will get more access to the text to natural sounding speech technology used in Google Assistant, Search, Maps and others. According to Google, Cloud Text-to-Speech can be used to power call center voice response systems, enabling Internet of things devices to talk and converting text-based media into spoken formats. Google Cloud Text-to-Speech allows customers to choose from 32 different voices in 12 languages. Everything you need to know about the cloud, explained How to choose your cloud provider: AWS, Google or Microsoft?


Google's Cloud Text-to-Speech gets more languages and voices - SiliconANGLE

#artificialintelligence

Google LLC today updated its Cloud Text-to-Speech service with new languages and voices in order to make it useful to more of its customers. Google Cloud Text-to-Speech is intended to help companies develop better conversational interfaces for the services they supply. It works by transforming written text into artificial speech that's spoken in realistic human voices. With the service, Google is targeting three main markets: voice response systems for call centers; "internet of things" products such as car infotainment systems, TVs and robots; and applications such as podcasts and audiobooks, which convert text into speech. In a blog post, Google product manager Dan Aharon said Cloud Text-to-Speech is getting 12 new languages or variants, including Czech, English (India), Filipino, Finnish, Greek, Hindi, Hungarian, Indonesian, Mandarin Chinese (China), Modern Standard Arabic and Vietnamese.