Goto

Collaborating Authors

 Speech Synthesis


Text-to-speech with feeling - this new AI model does everything but shed a tear

ZDNet

Not so long ago, generative AI could only communicate with human users via text. Now it's increasingly being given the power of speech -- and this ability is improving by the day. On Thursday, AI voice platform ElevenLabs introduced v3, described on the company's website as "the most expressive text-to-speech model ever." The new model can exhibit a wide range of emotions and subtle communicative quirks -- like sighs, laughter, and whispering -- making its speech more humanlike than the company's previous models. Also: Could WWDC be Apple's AI turning point?



Supplementary Material of HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis

Neural Information Processing Systems

Details of the Model Architecture The detailed architecture of the generator and MPD is depicted in Figure 4. The configuration of three variants of the generator is listed in Table 5. In the ResBlock of V1 and V2, 2 convolution layers and 1 residual connection are stacked 3 times. In the Resblock of V3, 1 convolution layer and 1 residual connection are stacked 2 times. Therefore, V3 consists of a much smaller number of layers than V1 and V2. Periodic signal discrimination experiments We conducted additional experiments similar to training a discriminator using a simple dataset to verify the ability of MPD to discriminate periodic signals.


Appendix A A proper scoring rule for speech synthesis

Neural Information Processing Systems

A loss function or scoring rule L(q, x) measures how well a model distribution q fits data x drawn from a distribution p. Such a scoring rule is called proper if its expectation is minimized when q = p. If the minimum is also unique, the scoring rule is called strictly proper. In the large data limit, a strictly proper scoring rule can uniquely identify the distribution p, which means that it can be used as the basis of a statistically consistent learning method. This includes the special cases of L1 and L2 distance, the latter of which they show leads to a strictly proper scoring rule.


GenerSpeech: Towards Style Transfer for Generalizable Out-Of-Domain Text-to-Speech

Neural Information Processing Systems

Style transfer for out-of-domain (OOD) speech synthesis aims to generate speech samples with unseen style (e.g., speaker identity, emotion, and prosody) derived from an acoustic reference, while facing the following challenges: 1) The highly dynamic style features in expressive voice are difficult to model and transfer; and 2) the TTS models should be robust enough to handle diverse OOD conditions that differ from the source data. This paper proposes GenerSpeech, a text-tospeech model towards high-fidelity zero-shot style transfer of OOD custom voice. GenerSpeech decomposes the speech variation into the style-agnostic and stylespecific parts by introducing two components: 1) a multi-level style adaptor to efficiently model a large range of style conditions, including global speaker and emotion characteristics, and the local (utterance, phoneme, and word-level) finegrained prosodic representations; and 2) a generalizable content adaptor with Mix-Style Layer Normalization to eliminate style information in the linguistic content representation and thus improve model generalization. Our evaluations on zero-shot style transfer demonstrate that GenerSpeech surpasses the state-of-the-art models in terms of audio quality and style similarity. The extension studies to adaptive style transfer further show that GenerSpeech performs robustly in the few-shot data setting.


Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis

Neural Information Processing Systems

We describe a neural network-based system for text-to-speech (TTS) synthesis that is able to generate speech audio in the voice of different speakers, including those unseen during training. Our system consists of three independently trained components: (1) a speaker encoder network, trained on a speaker verification task using an independent dataset of noisy speech without transcripts from thousands of speakers, to generate a fixed-dimensional embedding vector from only seconds of reference speech from a target speaker; (2) a sequence-to-sequence synthesis network based on Tacotron 2 that generates a mel spectrogram from text, conditioned on the speaker embedding; (3) an auto-regressive WaveNet-based vocoder network that converts the mel spectrogram into time domain waveform samples. We demonstrate that the proposed model is able to transfer the knowledge of speaker variability learned by the discriminatively-trained speaker encoder to the multispeaker TTS task, and is able to synthesize natural speech from speakers unseen during training. We quantify the importance of training the speaker encoder on a large and diverse speaker set in order to obtain the best generalization performance. Finally, we show that randomly sampled speaker embeddings can be used to synthesize speech in the voice of novel speakers dissimilar from those used in training, indicating that the model has learned a high quality speaker representation.




Supplementary Material of HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis

Neural Information Processing Systems

Details of the Model Architecture The detailed architecture of the generator and MPD is depicted in Figure 4. The configuration of three variants of the generator is listed in Table 5. In the ResBlock of V1 and V2, 2 convolution layers and 1 residual connection are stacked 3 times. In the Resblock of V3, 1 convolution layer and 1 residual connection are stacked 2 times. Therefore, V3 consists of a much smaller number of layers than V1 and V2. Periodic signal discrimination experiments We conducted additional experiments similar to training a discriminator using a simple dataset to verify the ability of MPD to discriminate periodic signals.