AITopics | p-flow

P-Flow: AFast and Data-Efficient Zero-Shot TTS through Speech Prompting

Neural Information Processing SystemsApr-30-2026, 04:34:00 GMT

While recent large-scale neural codec language models have shown significant improvement in zero-shot TTS by training on thousands of hours of data, they suffer from drawbacks such as a lack of robustness, slow sampling speed similar to previous autoregressive TTS methods, and reliance on pre-trained neural codec representations. Our work proposes P-Flow, a fast and data-efficient zero-shot TTS model that uses speech prompts for speaker adaptation. P-Flow comprises a speechprompted text encoder for speaker adaptation and a flow matching generative decoder for high-quality and fast speech synthesis. Our speech-prompted text encoder uses speech prompts and text input to generate speaker-conditional text representation. The flow matching generative decoder uses the speaker-conditional output to synthesize high-quality personalized speech significantly faster than in real-time. Unlike the neural codec language models, we specifically train P-Flow on LibriTTS dataset using a continuous mel-representation. Through our training method using continuous speech prompts, P-Flow matches the speaker similarity performance of the large-scale zero-shot TTS models with two orders of magnitude less training data and has more than 20 faster sampling speed. Our results show that P-Flow has better pronunciation and is preferred in human likeness and speaker similarity to its recent state-of-the-art counterparts, thus defining P-Flow as an attractive and desirable alternative. We provide audio samples on our demo page.

large language model, machine learning, natural language, (17 more...)

Neural Information Processing Systems

Genre: Research Report > New Finding (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

Supplementary Material for P-Flow

Neural Information Processing SystemsFeb-17-2026, 18:42:28 GMT

The link to our demo page is https://bit.ly/3ID5Zam. We present the objective metrics according to the Euler steps in the result section of the main paper. We measure the acoustic quality using 5-scale Mean Opinion Scores (MOS).

artificial intelligence, natural language, representation, (15 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Natural Language (0.32)

Add feedback

P-Flow: A Fast and Data-Efficient Zero-Shot TTS through Speech Prompting Sungwon Kim 1,2, Kevin J Shih

Neural Information Processing SystemsFeb-17-2026, 18:42:25 GMT

Our work proposes P-Flow, a fast and data-efficient zero-shot TTS model that uses speech prompts for speaker adaptation. P-Flow comprises a speech-prompted text encoder for speaker adaptation and a flow matching generative decoder for high-quality and fast speech synthesis.

large language model, machine learning, natural language, (18 more...)

Neural Information Processing Systems

Country:

Asia > South Korea > Seoul > Seoul (0.04)
Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)

Technology:

Information Technology > Artificial Intelligence > Speech (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

P-Flow: A Fast and Data-Efficient Zero-Shot TTS through Speech Prompting

Neural Information Processing SystemsDec-27-2025, 03:03:43 GMT

While recent large-scale neural codec language models have shown significant improvement in zero-shot TTS by training on thousands of hours of data, they suffer from drawbacks such as a lack of robustness, slow sampling speed similar to previous autoregressive TTS methods, and reliance on pre-trained neural codec representations. Our work proposes P-Flow, a fast and data-efficient zero-shot TTS model that uses speech prompts for speaker adaptation. P-Flow comprises a speech-prompted text encoder for speaker adaptation and a flow matching generative decoder for high-quality and fast speech synthesis. Our speech-prompted text encoder uses speech prompts and text input to generate speaker-conditional text representation. The flow matching generative decoder uses the speaker-conditional output to synthesize high-quality personalized speech significantly faster than in real-time. Unlike the neural codec language models, we specifically train P-Flow on LibriTTS dataset using a continuous mel-representation. Through our training method using continuous speech prompts, P-Flow matches the speaker similarity performance of the large-scale zero-shot TTS models with two orders of magnitude less training data and has more than 20$\times$ faster sampling speed. Our results show that P-Flow has better pronunciation and is preferred in human likeness and speaker similarity to its recent state-of-the-art counterparts, thus defining P-Flow as an attractive and desirable alternative.

fast and data-efficient zero-shot tts, name change, p-flow, (8 more...)

Neural Information Processing Systems

Country: Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.07)

Genre: Research Report > New Finding (0.59)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.74)

Add feedback

Supplementary Material for P-Flow

Neural Information Processing SystemsOct-9-2025, 10:48:10 GMT

The link to our demo page is https://bit.ly/3ID5Zam. We present the objective metrics according to the Euler steps in the result section of the main paper. We measure the acoustic quality using 5-scale Mean Opinion Scores (MOS).

artificial intelligence, natural language, representation, (15 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Natural Language (0.32)

Add feedback

P-Flow: A Fast and Data-Efficient Zero-Shot TTS through Speech Prompting

Neural Information Processing SystemsJan-20-2025, 01:30:38 GMT

While recent large-scale neural codec language models have shown significant improvement in zero-shot TTS by training on thousands of hours of data, they suffer from drawbacks such as a lack of robustness, slow sampling speed similar to previous autoregressive TTS methods, and reliance on pre-trained neural codec representations. Our work proposes P-Flow, a fast and data-efficient zero-shot TTS model that uses speech prompts for speaker adaptation. P-Flow comprises a speech-prompted text encoder for speaker adaptation and a flow matching generative decoder for high-quality and fast speech synthesis. Our speech-prompted text encoder uses speech prompts and text input to generate speaker-conditional text representation. The flow matching generative decoder uses the speaker-conditional output to synthesize high-quality personalized speech significantly faster than in real-time.

fast and data-efficient zero-shot tts, p-flow, speech prompting, (6 more...)

Neural Information Processing Systems

Country: Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.07)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.90)

Add feedback

Scaling NVIDIA's Multi-speaker Multi-lingual TTS Systems with Zero-Shot TTS to Indic Languages

Arora, Akshit, Badlani, Rohan, Kim, Sungwon, Valle, Rafael, Catanzaro, Bryan

arXiv.org Artificial IntelligenceJan-29-2024

In this paper, we describe the TTS models developed by NVIDIA for the MMITS-VC (Multi-speaker, Multi-lingual Indic TTS with Voice Cloning) 2024 Challenge. In Tracks 1 and 2, we utilize RAD-MMM to perform few-shot TTS by training additionally on 5 minutes of target speaker data. In Track 3, we utilize P-Flow to perform zero-shot TTS by training on the challenge dataset as well as external datasets. We use HiFi-GAN vocoders for all submissions. RAD-MMM performs competitively on Tracks 1 and 2, while P-Flow ranks first on Track 3, with mean opinion score (MOS) 4.4 and speaker similarity score (SMOS) of 3.62.

challenge dataset, dataset, zero-shot tts, (15 more...)

arXiv.org Artificial Intelligence

2401.13851

Country: Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.06)

Genre: Research Report (0.40)

Industry: Information Technology > Hardware (0.63)

Technology: