P-Flow: A Fast and Data-Efficient Zero-Shot TTS through Speech Prompting

Jan-20-2025, 01:30:38 GMT–Neural Information Processing Systems

While recent large-scale neural codec language models have shown significant improvement in zero-shot TTS by training on thousands of hours of data, they suffer from drawbacks such as a lack of robustness, slow sampling speed similar to previous autoregressive TTS methods, and reliance on pre-trained neural codec representations. Our work proposes P-Flow, a fast and data-efficient zero-shot TTS model that uses speech prompts for speaker adaptation. P-Flow comprises a speech-prompted text encoder for speaker adaptation and a flow matching generative decoder for high-quality and fast speech synthesis. Our speech-prompted text encoder uses speech prompts and text input to generate speaker-conditional text representation. The flow matching generative decoder uses the speaker-conditional output to synthesize high-quality personalized speech significantly faster than in real-time.

fast and data-efficient zero-shot tts, p-flow, speech prompting, (6 more...)

Neural Information Processing Systems

Jan-20-2025, 01:30:38 GMT

Conferences Web Page

Add feedback

Country:
- Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.07)

Technology:
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.90)