XPhoneBERT: A Pre-trained Multilingual Model for Phoneme Representations for Text-to-Speech

Nguyen, Linh The, Pham, Thinh, Nguyen, Dat Quoc

May-31-2023–arXiv.org Artificial Intelligence

We present XPhoneBERT, the first multilingual model pre-trained to learn phoneme representations for the downstream text-to-speech (TTS) task. Our XPhoneBERT has the same model architecture as BERT-base, trained using the RoBERTa pre-training approach on 330M phoneme-level sentences from nearly 100 languages and locales. Experimental results show that employing XPhoneBERT as an input phoneme encoder significantly boosts the performance of a strong neural TTS model in terms of naturalness and prosody and also helps produce fairly high-quality speech with limited training data. We publicly release our pre-trained XPhoneBERT with the hope that it would facilitate future research and downstream TTS applications for multiple languages. Our XPhoneBERT model is available at https://github.com/VinAIResearch/XPhoneBERT

artificial intelligence, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

May-31-2023

arXiv.org PDF

Add feedback

Genre:
- Research Report > New Finding (0.48)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning (1.00)
  - Natural Language (1.00)
  - Speech > Speech Synthesis (0.74)
  - Vision > Optical Character Recognition (0.63)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found