Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale Matthew Le

Neural Information Processing Systems 

In particular, V oicebox outperforms the state-of-the-art zero-shot TTS model V ALL-E on both intelligibility (5.9% vs 1.9% word error rates) and audio similarity (0.580 vs 0.681) while being up to 20 times faster.

Similar Docs  Excel Report  more

TitleSimilaritySource
None found