Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale Matthew Le Bowen Shi Brian Karrer

Neural Information Processing Systems 

Large-scale generative models such as GPT and DALL-E have revolutionized the research community. These models not only generate high fidelity outputs, but are also generalists which can solve tasks not explicitly taught. In contrast, speech generative models are still primitive in terms of scale and task generalization.