AITopics | generspeech

Style transfer for out-of-domain (OOD) speech synthesis aims to generate speech samples with unseen style (e.g., speaker identity, emotion, and prosody) derived from an acoustic reference, while facing the following challenges: 1) The highly dynamic style features in expressive voice are difficult to model and transfer; and 2) the TTS models should be robust enough to handle diverse OOD conditions that differ from the source data. This paper proposes GenerSpeech, a text-to-speech model towards high-fidelity zero-shot style transfer of OOD custom voice. GenerSpeech decomposes the speech variation into the style-agnostic and style-specific parts by introducing two components: 1) a multi-level style adaptor to efficiently model a large range of style conditions, including global speaker and emotion characteristics, and the local (utterance, phoneme, and word-level) fine-grained prosodic representations; and 2) a generalizable content adaptor with Mix-Style Layer Normalization to eliminate style information in the linguistic content representation and thus improve model generalization. Our evaluations on zero-shot style transfer demonstrate that GenerSpeech surpasses the state-of-the-art models in terms of audio quality and style similarity. The extension studies to adaptive style transfer further show that GenerSpeech performs robustly in the few-shot data setting. Audio samples are available at \url{https://GenerSpeech.github.io/}.

generalizable out-of-domain text-to-speech, generspeech, style transfer, (4 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Synthesis (0.62)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.50)

Add feedback

4730d10b22261faa9a95ebf7497bc556-Supplemental-Conference.pdf

Neural Information Processing SystemsAug-14-2025, 14:18:25 GMT

generspeech, mean opinion score, visualization, (13 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.48)

Add feedback

GenerSpeech: Towards Style Transfer for Generalizable Out-Of-Domain Text-to-Speech

Neural Information Processing SystemsAug-14-2025, 14:18:20 GMT

This paper proposes GenerSpeech, a text-to-speech model towards high-fidelity zero-shot style transfer of OOD custom voice.

arxiv preprint arxiv, generspeech, representation, (13 more...)

Neural Information Processing Systems

Country:

Asia > China (0.04)
Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)

Genre: Research Report (0.93)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)
Information Technology > Artificial Intelligence > Speech > Speech Synthesis (0.77)
Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (0.65)

Add feedback

GenerSpeech: Towards Style Transfer for Generalizable Out-Of-Domain Text-to-Speech

Neural Information Processing SystemsOct-10-2024, 21:38:43 GMT

Style transfer for out-of-domain (OOD) speech synthesis aims to generate speech samples with unseen style (e.g., speaker identity, emotion, and prosody) derived from an acoustic reference, while facing the following challenges: 1) The highly dynamic style features in expressive voice are difficult to model and transfer; and 2) the TTS models should be robust enough to handle diverse OOD conditions that differ from the source data. This paper proposes GenerSpeech, a text-to-speech model towards high-fidelity zero-shot style transfer of OOD custom voice. GenerSpeech decomposes the speech variation into the style-agnostic and style-specific parts by introducing two components: 1) a multi-level style adaptor to efficiently model a large range of style conditions, including global speaker and emotion characteristics, and the local (utterance, phoneme, and word-level) fine-grained prosodic representations; and 2) a generalizable content adaptor with Mix-Style Layer Normalization to eliminate style information in the linguistic content representation and thus improve model generalization. Our evaluations on zero-shot style transfer demonstrate that GenerSpeech surpasses the state-of-the-art models in terms of audio quality and style similarity. The extension studies to adaptive style transfer further show that GenerSpeech performs robustly in the few-shot data setting.

generalizable out-of-domain text-to-speech, generspeech, style transfer, (1 more...)

Neural Information Processing Systems

Genre: Play > Prospect > Charge (0.86)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Synthesis (0.98)
Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (0.64)
Information Technology > Artificial Intelligence > Assistive Technologies (0.64)

Add feedback

Style Mixture of Experts for Expressive Text-To-Speech Synthesis

Jawaid, Ahad, Chandra, Shreeram Suresh, Lu, Junchen, Sisman, Berrak

arXiv.org Artificial IntelligenceJun-5-2024

Recent advances in style transfer text-to-speech (TTS) have improved the expressiveness of synthesized speech. Despite these advancements, encoding stylistic information from diverse and unseen reference speech remains challenging. This paper introduces StyleMoE, an approach that divides the embedding space, modeled by the style encoder, into tractable subsets handled by style experts. The proposed method replaces the style encoder in a TTS system with a Mixture of Experts (MoE) layer. By utilizing a gating network to route reference speeches to different style experts, each expert specializes in aspects of the style space during optimization. Our experiments objectively and subjectively demonstrate the effectiveness of our proposed method in increasing the coverage of the style space for diverse and unseen styles. This approach can enhance the performance of existing state-of-the-art style transfer TTS models, marking the first study of MoE in style transfer TTS to our knowledge.

speech, style encoder, style expert, (12 more...)

arXiv.org Artificial Intelligence

2406.03637

Country:

North America > United States > Texas (0.04)
Asia > Singapore > Central Region > Singapore (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Synthesis (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

GenerSpeech: Towards Style Transfer for Generalizable Out-Of-Domain Text-to-Speech

Huang, Rongjie, Ren, Yi, Liu, Jinglin, Cui, Chenye, Zhao, Zhou

arXiv.org Artificial IntelligenceOct-12-2022

Style transfer for out-of-domain (OOD) speech synthesis aims to generate speech samples with unseen style (e.g., speaker identity, emotion, and prosody) derived from an acoustic reference, while facing the following challenges: 1) The highly dynamic style features in expressive voice are difficult to model and transfer; and 2) the TTS models should be robust enough to handle diverse OOD conditions that differ from the source data. This paper proposes GenerSpeech, a text-to-speech model towards high-fidelity zero-shot style transfer of OOD custom voice. GenerSpeech decomposes the speech variation into the style-agnostic and style-specific parts by introducing two components: 1) a multi-level style adaptor to efficiently model a large range of style conditions, including global speaker and emotion characteristics, and the local (utterance, phoneme, and word-level) fine-grained prosodic representations; and 2) a generalizable content adaptor with Mix-Style Layer Normalization to eliminate style information in the linguistic content representation and thus improve model generalization. Our evaluations on zero-shot style transfer demonstrate that GenerSpeech surpasses the state-of-the-art models in terms of audio quality and style similarity. The extension studies to adaptive style transfer further show that GenerSpeech performs robustly in the few-shot data setting. Audio samples are available at https://GenerSpeech.github.io/

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2205.07211

Country:

Asia > China (0.04)
Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)

Genre: Research Report > Promising Solution (0.34)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Synthesis (0.96)
Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (0.74)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.69)

Add feedback

Filters

Collaborating Authors

generspeech

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

4730d10b22261faa9a95ebf7497bc556-Supplemental-Conference.pdf

4730d10b22261faa9a95ebf7497bc556-Paper-Conference.pdf

GenerSpeech: Towards Style Transfer for Generalizable Out-Of-Domain Text-to-Speech

4730d10b22261faa9a95ebf7497bc556-Supplemental-Conference.pdf

GenerSpeech: Towards Style Transfer for Generalizable Out-Of-Domain Text-to-Speech

GenerSpeech: Towards Style Transfer for Generalizable Out-Of-Domain Text-to-Speech

Style Mixture of Experts for Expressive Text-To-Speech Synthesis

GenerSpeech: Towards Style Transfer for Generalizable Out-Of-Domain Text-to-Speech