Virtuoso: Massive Multilingual Speech-Text Joint Semi-Supervised Learning for Text-To-Speech
Saeki, Takaaki, Zen, Heiga, Chen, Zhehuai, Morioka, Nobuyuki, Wang, Gary, Zhang, Yu, Bapna, Ankur, Rosenberg, Andrew, Ramabhadran, Bhuvana
–arXiv.org Artificial Intelligence
Although This paper proposes Virtuoso, a massively multilingual speech-text various approaches of massively multilingual self/semi-supervised joint semi-supervised learning framework for text-to-speech synthesis learning have been attempted for speech recognition tasks, they have (TTS) models. Existing multilingual TTS typically supports tens not been fully explored for multilingual speech generation tasks. of languages, which are a small fraction of the thousands of languages This paper proposes Virtuoso, a massive multilingual speech-in the world. One difficulty to scale multilingual TTS to hundreds of text joint pretraining framework based on self-supervised and semisupervised languages is collecting high-quality speech-text paired data in lowresource learning. It extends Maestro [6], a speech-text semisupervised languages. This study extends Maestro, a speech-text joint pretraining framework for ASR, to speech generation pretraining framework for automatic speech recognition (ASR), to tasks. Virtuoso allows us to pretrain a multilingual TTS model using speech generation tasks. To train a TTS model from various types unsupervised (untranscribed speech and unspoken text) and supervised of speech and text data, different training schemes are designed to (paired TTS and ASR data) datasets with training schemes handle supervised (paired TTS and ASR data) and unsupervised designed for them, which will allow the model to scale to hundreds (untranscribed speech and unspoken text) datasets.
arXiv.org Artificial Intelligence
Mar-15-2023