Virtuoso: Massive Multilingual Speech-Text Joint Semi-Supervised Learning for Text-To-Speech

Saeki, Takaaki, Zen, Heiga, Chen, Zhehuai, Morioka, Nobuyuki, Wang, Gary, Zhang, Yu, Bapna, Ankur, Rosenberg, Andrew, Ramabhadran, Bhuvana

Mar-15-2023–arXiv.org Artificial Intelligence

Although This paper proposes Virtuoso, a massively multilingual speech-text various approaches of massively multilingual self/semi-supervised joint semi-supervised learning framework for text-to-speech synthesis learning have been attempted for speech recognition tasks, they have (TTS) models. Existing multilingual TTS typically supports tens not been fully explored for multilingual speech generation tasks. of languages, which are a small fraction of the thousands of languages This paper proposes Virtuoso, a massive multilingual speech-in the world. One difficulty to scale multilingual TTS to hundreds of text joint pretraining framework based on self-supervised and semisupervised languages is collecting high-quality speech-text paired data in lowresource learning. It extends Maestro [6], a speech-text semisupervised languages. This study extends Maestro, a speech-text joint pretraining framework for ASR, to speech generation pretraining framework for automatic speech recognition (ASR), to tasks. Virtuoso allows us to pretrain a multilingual TTS model using speech generation tasks. To train a TTS model from various types unsupervised (untranscribed speech and unspoken text) and supervised of speech and text data, different training schemes are designed to (paired TTS and ASR data) datasets with training schemes handle supervised (paired TTS and ASR data) and unsupervised designed for them, which will allow the model to scale to hundreds (untranscribed speech and unspoken text) datasets.

artificial intelligence, machine learning, tts data, (18 more...)

arXiv.org Artificial Intelligence

Mar-15-2023

arXiv.org PDF

Add feedback

Country:
- Asia > Japan (0.29)

Genre:
- Research Report (0.50)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning (1.00)
  - Speech
    - Speech Recognition (1.00)
    - Speech Synthesis (0.94)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found