A Spectral Energy Distance for Parallel Speech Synthesis

Oct-10-2024, 20:44:49 GMT–Neural Information Processing Systems

Speech synthesis is an important practical generative modeling problem that has seen great progress over the last few years, with likelihood-based autoregressive neural models now outperforming traditional concatenative systems. A downside of such autoregressive models is that they require executing tens of thousands of sequential operations per second of generated audio, making them ill-suited for deployment on specialized deep learning hardware. Here, we propose a new learning method that allows us to train highly parallel models of speech, without requiring access to an analytical likelihood function. Our approach is based on a generalized energy distance between the distributions of the generated and real audio. This spectral energy distance is a proper scoring rule with respect to the distribution over magnitude-spectrograms of the generated waveform audio and offers statistical consistency guarantees.

implicit generative model, parallel speech synthesis, spectral energy distance

Neural Information Processing Systems

Oct-10-2024, 20:44:49 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence
  - Speech > Speech Synthesis (0.64)
  - Machine Learning > Neural Networks
    - Deep Learning (0.62)