Review for NeurIPS paper: A Spectral Energy Distance for Parallel Speech Synthesis
–Neural Information Processing Systems
Additional Feedback: Comments: - Section 2: Flow-based models are not necessarily large. The new SOTA WaveFlow is a small-footprint flow-based model for raw audio. The authors may reference WaveFlow and clarify the inaccurate claim in related work section. I usually don't take such FDSD measures seriously, as it couldn't provide meaningful comparisons across different models in general, which is also observed by the authors. It would very nice to see an ablation study with MOS scores by varying three design choices: 1) w/ or w/o repulsive term, 2) single or multi-scale spectrogram loss, 3) w/ or w/o GAN loss. It will single out and emphasize the benefit of repulsive term under different circumstances.
Neural Information Processing Systems
Jan-26-2025, 20:11:59 GMT
- Technology: