Towards zero-shot Text-based voice editing using acoustic context conditioning, utterance embeddings, and reference encoders

Open in new window