Review for NeurIPS paper: A Spectral Energy Distance for Parallel Speech Synthesis

Neural Information Processing Systems 

This paper proposes a strategy for parallel TTS based on spectral energy distance. It does not rely on explicit optimization of likelihood nor adversarial learning, which enjoys a more stable and consistent training. On top of that, the authors introduce a repulsive term which has shown to significantly improve the quality of the generated speech. When combined with adversarial training, the quality of speech can be further improved. Overall, this is an interesting work, technically solid and experimentally compelling.