FastSpeech: Fast, Robust and Controllable Text to Speech

Yi Ren, Yangjun Ruan, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, Tie-Yan Liu

Aug-20-2025, 09:32:38 GMT–Neural Information Processing Systems

Prominent methods (e.g., Tacotron 2) usually first generate mel-spectrogram from text, and then synthesize speech from the mel-spectrogram using vocoder such as WaveNet. Compared with traditional concatenative and statistical parametric approaches, neural network based end-to-end models suffer from slow inference speed, and the synthesized speech is usually not robust (i.e., some words are skipped or repeated) and lack of con-trollability (voice speed or prosody control).

fastspeech, mel-spectrogram sequence, sequence, (10 more...)

Neural Information Processing Systems

Aug-20-2025, 09:32:38 GMT

Conferences PDF

Add feedback

Country:
- Asia > China (0.04)
- North America > Canada (0.04)
- Europe > Italy
  - Calabria > Catanzaro Province > Catanzaro (0.04)

Genre:
- Research Report (0.47)

Technology:
- Information Technology > Artificial Intelligence
  - Speech (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (0.46)

Duplicate Docs Excel Report

Title
FastSpeech: Fast, Robust and Controllable Text to Speech

Similar Docs Excel Report more

Title	Similarity	Source
None found