AITopics | Speech Synthesis

Supplementary Material of HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis

Neural Information Processing SystemsMay-31-2025, 19:13:53 GMT

Details of the Model Architecture The detailed architecture of the generator and MPD is depicted in Figure 4. The configuration of three variants of the generator is listed in Table 5. In the ResBlock of V1 and V2, 2 convolution layers and 1 residual connection are stacked 3 times. In the Resblock of V3, 1 convolution layer and 1 residual connection are stacked 2 times. Therefore, V3 consists of a much smaller number of layers than V1 and V2. Periodic signal discrimination experiments We conducted additional experiments similar to training a discriminator using a simple dataset to verify the ability of MPD to discriminate periodic signals.

artificial intelligence, generator, machine learning, (16 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.51)
Information Technology > Artificial Intelligence > Speech > Speech Synthesis (0.41)
Information Technology > Artificial Intelligence > Machine Learning > Unsupervised or Indirectly Supervised Learning (0.41)

Add feedback

GenerSpeech: Towards Style Transfer for Generalizable Out-Of-Domain Text-to-Speech

Neural Information Processing SystemsMay-29-2025, 15:47:46 GMT

Style transfer for out-of-domain (OOD) speech synthesis aims to generate speech samples with unseen style (e.g., speaker identity, emotion, and prosody) derived from an acoustic reference, while facing the following challenges: 1) The highly dynamic style features in expressive voice are difficult to model and transfer; and 2) the TTS models should be robust enough to handle diverse OOD conditions that differ from the source data. This paper proposes GenerSpeech, a text-tospeech model towards high-fidelity zero-shot style transfer of OOD custom voice. GenerSpeech decomposes the speech variation into the style-agnostic and stylespecific parts by introducing two components: 1) a multi-level style adaptor to efficiently model a large range of style conditions, including global speaker and emotion characteristics, and the local (utterance, phoneme, and word-level) finegrained prosodic representations; and 2) a generalizable content adaptor with Mix-Style Layer Normalization to eliminate style information in the linguistic content representation and thus improve model generalization. Our evaluations on zero-shot style transfer demonstrate that GenerSpeech surpasses the state-of-the-art models in terms of audio quality and style similarity. The extension studies to adaptive style transfer further show that GenerSpeech performs robustly in the few-shot data setting.

large language model, machine learning, natural language, (17 more...)

Neural Information Processing Systems

Genre: Research Report > Promising Solution (0.34)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)
Information Technology > Artificial Intelligence > Speech > Speech Synthesis (0.77)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.69)

Add feedback

Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis

Ye Jia, Yu Zhang, Ron Weiss, Quan Wang, Jonathan Shen, Fei Ren, zhifeng Chen, Patrick Nguyen, Ruoming Pang, Ignacio Lopez Moreno, Yonghui Wu

Neural Information Processing SystemsMay-26-2025, 07:04:15 GMT

We describe a neural network-based system for text-to-speech (TTS) synthesis that is able to generate speech audio in the voice of different speakers, including those unseen during training. Our system consists of three independently trained components: (1) a speaker encoder network, trained on a speaker verification task using an independent dataset of noisy speech without transcripts from thousands of speakers, to generate a fixed-dimensional embedding vector from only seconds of reference speech from a target speaker; (2) a sequence-to-sequence synthesis network based on Tacotron 2 that generates a mel spectrogram from text, conditioned on the speaker embedding; (3) an auto-regressive WaveNet-based vocoder network that converts the mel spectrogram into time domain waveform samples. We demonstrate that the proposed model is able to transfer the knowledge of speaker variability learned by the discriminatively-trained speaker encoder to the multispeaker TTS task, and is able to synthesize natural speech from speakers unseen during training. We quantify the importance of training the speaker encoder on a large and diverse speaker set in order to obtain the best generalization performance. Finally, we show that randomly sampled speaker embeddings can be used to synthesize speech in the voice of novel speakers dissimilar from those used in training, indicating that the model has learned a high quality speaker representation.

artificial intelligence, machine learning, speech, (19 more...)

Neural Information Processing Systems

Country: North America > Canada (0.14)

Industry: Information Technology (0.46)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Synthesis (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Speech > Acoustic Processing (0.86)

Add feedback

SupertonicTTS: Towards Highly Scalable and Efficient Text-to-Speech System

Kim, Hyeongju, Yang, Jinhyeok, Yu, Yechan, Ji, Seunghun, Morton, Jacob, Bous, Frederik, Byun, Joon, Lee, Juheon

arXiv.org Artificial IntelligenceMar-29-2025

We present a novel text-to-speech (TTS) system, namely SupertonicTTS, for improved scalability and efficiency in speech synthesis. SupertonicTTS is comprised of three components: a speech autoencoder for continuous latent representation, a text-to-latent module leveraging flow-matching for text-to-latent mapping, and an utterance-level duration predictor. To enable a lightweight architecture, we employ a low-dimensional latent space, temporal compression of latents, and ConvNeXt blocks. We further simplify the TTS pipeline by operating directly on raw character-level text and employing cross-attention for text-speech alignment, thus eliminating the need for grapheme-to-phoneme (G2P) modules and external aligners. In addition, we introduce context-sharing batch expansion that accelerates loss convergence and stabilizes text-speech alignment. Experimental results demonstrate that SupertonicTTS achieves competitive performance while significantly reducing architectural complexity and computational overhead compared to contemporary TTS models. Audio samples demonstrating the capabilities of SupertonicTTS are available at: https://supertonictts.github.io/.

artificial intelligence, machine learning, representation, (14 more...)

arXiv.org Artificial Intelligence

2503.23108

Genre: Research Report > New Finding (0.66)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Synthesis (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.93)

Add feedback

Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis

Ye Jia, Yu Zhang, Ron Weiss, Quan Wang, Jonathan Shen, Fei Ren, zhifeng Chen, Patrick Nguyen, Ruoming Pang, Ignacio Lopez Moreno, Yonghui Wu

Neural Information Processing SystemsMar-26-2025, 05:38:45 GMT

We describe a neural network-based system for text-to-speech (TTS) synthesis that is able to generate speech audio in the voice of different speakers, including those unseen during training. Our system consists of three independently trained components: (1) a speaker encoder network, trained on a speaker verification task using an independent dataset of noisy speech without transcripts from thousands of speakers, to generate a fixed-dimensional embedding vector from only seconds of reference speech from a target speaker; (2) a sequence-to-sequence synthesis network based on Tacotron 2 that generates a mel spectrogram from text, conditioned on the speaker embedding; (3) an auto-regressive WaveNet-based vocoder network that converts the mel spectrogram into time domain waveform samples. We demonstrate that the proposed model is able to transfer the knowledge of speaker variability learned by the discriminatively-trained speaker encoder to the multispeaker TTS task, and is able to synthesize natural speech from speakers unseen during training. We quantify the importance of training the speaker encoder on a large and diverse speaker set in order to obtain the best generalization performance. Finally, we show that randomly sampled speaker embeddings can be used to synthesize speech in the voice of novel speakers dissimilar from those used in training, indicating that the model has learned a high quality speaker representation.

artificial intelligence, machine learning, speech, (18 more...)

Neural Information Processing Systems

Country: North America (0.28)

Industry: Information Technology (0.46)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Synthesis (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Speech > Acoustic Processing (0.86)

Add feedback

Dict-TTS: Learning to Pronounce with Prior Dictionary Knowledge for Text-to-Speech

Neural Information Processing SystemsMar-21-2025, 17:22:44 GMT

Polyphone disambiguation aims to capture accurate pronunciation knowledge from natural text sequences for reliable Text-to-speech (TTS) systems. However, previous approaches require substantial annotated training data and additional efforts from language experts, making it difficult to extend high-quality neural TTS systems to out-of-domain daily conversations and countless languages worldwide. This paper tackles the polyphone disambiguation problem from a concise and novel perspective: we propose Dict-TTS, a semantic-aware generative text-to-speech model with an online website dictionary (the existing prior information in the natural language). Specifically, we design a semantics-to-pronunciation attention (S2PA) module to match the semantic patterns between the input text sequence and the prior semantics in the dictionary and obtain the corresponding pronunciations; The S2PA module can be easily trained with the end-to-end TTS model without any annotated phoneme labels. Experimental results in three languages show that our model outperforms several strong baseline models in terms of pronunciation accuracy and improves the prosody modeling of TTS systems. Further extensive analyses demonstrate that each design in Dict-TTS is effective.

machine learning, natural language, pronunciation, (20 more...)

Neural Information Processing Systems

Country:

Europe (0.93)
North America > United States (0.46)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Synthesis (0.93)
Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (0.92)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models

Neural Information Processing SystemsMar-21-2025, 16:54:54 GMT

In this paper, we present StyleTTS 2, a text-to-speech (TTS) model that leverages style diffusion and adversarial training with large speech language models (SLMs) to achieve human-level TTS synthesis. StyleTTS 2 differs from its predecessor by modeling styles as a latent random variable through diffusion models to generate the most suitable style for the text without requiring reference speech, achieving efficient latent diffusion while benefiting from the diverse speech synthesis offered by diffusion models. Furthermore, we employ large pre-trained SLMs, such as WavLM, as discriminators with our novel differentiable duration modeling for endto-end training, resulting in improved speech naturalness. StyleTTS 2 surpasses human recordings on the single-speaker LJSpeech dataset and matches it on the multispeaker VCTK dataset as judged by native English speakers. Moreover, when trained on the LibriTTS dataset, our model outperforms previous publicly available models for zero-shot speaker adaptation. This work achieves the first human-level TTS on both single and multispeaker datasets, showcasing the potential of style diffusion and adversarial training with large SLMs. The audio demos and source code are available at https://styletts2.github.io/.

large language model, machine learning, natural language, (18 more...)

Neural Information Processing Systems

Country:

North America > United States (0.14)
Europe (0.14)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (0.68)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Synthesis (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
(2 more...)

Add feedback

GenerSpeech: Towards Style Transfer for Generalizable Out-Of-Domain Text-to-Speech

Neural Information Processing SystemsMar-21-2025, 00:02:06 GMT

Style transfer for out-of-domain (OOD) speech synthesis aims to generate speech samples with unseen style (e.g., speaker identity, emotion, and prosody) derived from an acoustic reference, while facing the following challenges: 1) The highly dynamic style features in expressive voice are difficult to model and transfer; and 2) the TTS models should be robust enough to handle diverse OOD conditions that differ from the source data. This paper proposes GenerSpeech, a text-tospeech model towards high-fidelity zero-shot style transfer of OOD custom voice. GenerSpeech decomposes the speech variation into the style-agnostic and stylespecific parts by introducing two components: 1) a multi-level style adaptor to efficiently model a large range of style conditions, including global speaker and emotion characteristics, and the local (utterance, phoneme, and word-level) finegrained prosodic representations; and 2) a generalizable content adaptor with Mix-Style Layer Normalization to eliminate style information in the linguistic content representation and thus improve model generalization. Our evaluations on zero-shot style transfer demonstrate that GenerSpeech surpasses the state-of-the-art models in terms of audio quality and style similarity. The extension studies to adaptive style transfer further show that GenerSpeech performs robustly in the few-shot data setting.

large language model, machine learning, natural language, (16 more...)

Neural Information Processing Systems

Genre: Research Report > Promising Solution (0.34)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)
Information Technology > Artificial Intelligence > Speech > Speech Synthesis (0.77)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.69)

Add feedback

Appendix A A proper scoring rule for speech synthesis

Neural Information Processing SystemsMar-19-2025, 22:05:59 GMT

A loss function or scoring rule L(q, x) measures how well a model distribution q fits data x drawn from a distribution p. Such a scoring rule is called proper if its expectation is minimized when q = p. If the minimum is also unique, the scoring rule is called strictly proper. In the large data limit, a strictly proper scoring rule can uniquely identify the distribution p, which means that it can be used as the basis of a statistically consistent learning method. This includes the special cases of L1 and L2 distance, the latter of which they show leads to a strictly proper scoring rule.

artificial intelligence, implementation, machine learning, (17 more...)

Neural Information Processing Systems

Industry: Leisure & Entertainment > Games (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)
Information Technology > Artificial Intelligence > Speech > Speech Synthesis (0.40)

Add feedback

Supplementary Material of Glow-TTS: A Generative Flow for Text-to-Speech via Monotonic Alignment Search

Neural Information Processing SystemsMar-19-2025, 03:36:30 GMT

Details of the Model Architecture The detailed encoder architecture is depicted in Figure 7. Some implementation details that we use in the decoder, and the decoder architecture are depicted in Figure 8. We design the grouped 1x1 convolutions to be able to mix channels. For each group, the same number of channels are extracted from one half of the feature map separated by coupling layers and the other half, respectively. Figure 8c shows an example.

artificial intelligence, glow-tts, optical character recognition, (17 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (0.41)
Information Technology > Artificial Intelligence > Speech > Speech Synthesis (0.41)
Information Technology > Artificial Intelligence > Assistive Technologies (0.41)

Add feedback

Filters

Collaborating Authors

Speech Synthesis

Supplementary Material of HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis

GenerSpeech: Towards Style Transfer for Generalizable Out-Of-Domain Text-to-Speech

Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis

SupertonicTTS: Towards Highly Scalable and Efficient Text-to-Speech System

Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis

Dict-TTS: Learning to Pronounce with Prior Dictionary Knowledge for Text-to-Speech

StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models

GenerSpeech: Towards Style Transfer for Generalizable Out-Of-Domain Text-to-Speech

Appendix A A proper scoring rule for speech synthesis

Supplementary Material of Glow-TTS: A Generative Flow for Text-to-Speech via Monotonic Alignment Search