PortaSpeech: Portable and High-Quality Generative Text-to-Speech

Dec-24-2025, 07:34:19 GMT–Neural Information Processing Systems

Non-autoregressive text-to-speech (NAR-TTS) models such as FastSpeech 2 and Glow-TTS can synthesize high-quality speech from the given text in parallel. After analyzing two kinds of generative NAR-TTS models (VAE and normalizing flow), we find that: VAE is good at capturing the long-range semantics features (e.g., prosody) even with small model size but suffers from blurry and unnatural results; and normalizing flow is good at reconstructing the frequency bin-wise details but performs poorly when the number of model parameters is limited. Inspired by these observations, to generate diverse speech with natural details and rich prosody using a lightweight architecture, we propose PortaSpeech, a portable and high-quality generative text-to-speech model. Specifically, 1) to model both the prosody and mel-spectrogram details accurately, we adopt a lightweight VAE with an enhanced prior followed by a flow-based post-net with strong conditional inputs as the main architecture.

name change, portable and high-quality generative text-to-speech, portaspeech, (7 more...)

Neural Information Processing Systems

Dec-24-2025, 07:34:19 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence (1.00)