Review for NeurIPS paper: A Spectral Energy Distance for Parallel Speech Synthesis

Jan-26-2025, 20:11:59 GMT–Neural Information Processing Systems

Additional Feedback: Comments: - Section 2: Flow-based models are not necessarily large. The new SOTA WaveFlow is a small-footprint flow-based model for raw audio. The authors may reference WaveFlow and clarify the inaccurate claim in related work section. I usually don't take such FDSD measures seriously, as it couldn't provide meaningful comparisons across different models in general, which is also observed by the authors. It would very nice to see an ablation study with MOS scores by varying three design choices: 1) w/ or w/o repulsive term, 2) single or multi-scale spectrogram loss, 3) w/ or w/o GAN loss. It will single out and emphasize the benefit of repulsive term under different circumstances.

multi-scale spectrogram loss, parallel speech synthesis, repulsive term, (8 more...)

Neural Information Processing Systems

Jan-26-2025, 20:11:59 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence > Speech > Speech Synthesis (0.40)