Noise-robust zero-shot text-to-speech synthesis conditioned on self-supervised speech-representation model with adapters

Open in new window