Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale Matthew Le
–Neural Information Processing Systems
In particular, V oicebox outperforms the state-of-the-art zero-shot TTS model V ALL-E on both intelligibility (5.9% vs 1.9% word error rates) and audio similarity (0.580 vs 0.681) while being up to 20 times faster.
Neural Information Processing Systems
Nov-14-2025, 18:22:01 GMT
- Country:
- Asia > Middle East
- Israel > Jerusalem District > Jerusalem (0.04)
- Europe > United Kingdom
- North Sea > Southern North Sea (0.04)
- North America > Canada
- South America > Colombia
- Meta Department > Villavicencio (0.04)
- Asia > Middle East
- Genre:
- Research Report > New Finding (0.46)
- Technology:
- Information Technology > Artificial Intelligence
- Machine Learning > Neural Networks
- Deep Learning (0.67)
- Natural Language (1.00)
- Speech > Speech Recognition (1.00)
- Vision (1.00)
- Machine Learning > Neural Networks
- Information Technology > Artificial Intelligence