We're headed for a revolution in computer-generated speech, and a voice clone of Microsoft founder Bill Gates demonstrates exactly why. In the clips embedded below, you can listen to what seems to be Gates reeling off a series of innocuous phrases. "A cramp is no small danger on a swim," he cautions. "Write a fond note to the friend you cherish," he advises. But each voice clip has been generated by a machine learning system named MelNet, designed and created by engineers at Facebook.
Capturing high-level structure in audio waveforms is challenging because a single second of audio spans tens of thousands of timesteps. While long-range dependencies are difficult to model directly in the time domain, we show that they can be more tractably modelled in two-dimensional time-frequency representations such as spectrograms. By leveraging this representational advantage, in conjunction with a highly expressive probabilistic model and a multiscale generation procedure, we design a model capable of generating high-fidelity audio samples which capture structure at timescales that time-domain models have yet to achieve. We apply our model to a variety of audio generation tasks, including unconditional speech generation, music generation, and text-to-speech synthesis---showing improvements over previous approaches in both density estimates and human judgments.
Amazon enhanced Polly - the cloud-based text-to-speech service - to deliver natural and realistic speech synthesis. The service can now be leveraged to present domain-specific style such as newscast and sportscast. Though text-to-speech existed for more than two decades, it is never used in mainstream media due to the lack of natural and realistic modulation. Except for automated announcements that read out from existing datastores, the technology never replaced human voice and speech. Thanks to the advancements in AI, text-to-speech has evolved to become more natural and realistic to an extent that it may be hard to distinguish it from a human voice.
The slow progress on realistic text-to-speech systems is not from lack of trying. Numerous teams have attempted to train deep-learning algorithms to reproduce real speech patterns using large databases of audio. The problem with this approach, say Vasquez and Lewis, is with the type of data. Until now, most work has focused on audio waveform recordings. These show how the amplitude of sound changes over time, with each second of recorded audio consisting of tens of thousands of time steps.
Months after Amazon launched in general availability Neural Text-To-Speech (NTTS) and newscaster style in Amazon Polly, a cloud service that converts text into speech, the Seattle company today debuted two new NTTS voices in U.S. Spanish and Brazilian Portuguese: "Lupe" and "Camila." Like the U.S. English NTTS voice before them, they mimic things like stress and intonation in speech courtesy by identifying tonal patterns. Neural versions of Camila and Lupe are available in Amazon Web Services' (AWS) U.S. East (N. Standard variants are also available across 18 AWS regions, bringing Polly's total number of voices to 61 across 29 languages and the total number of voices available in both standard and neural versions to 13 across four languages. According to Amazon text-to-speech program manager Marta Smolarek, the new U.S. Spanish voice -- Lupe, which is the third U.S. text-to-speech voice in Polly -- not only speaks Spanish but also handles English and provides a fully bilingual Spanish-English experience.