Phoneme-Level BERT for Enhanced Prosody of Text-to-Speech with Grapheme Predictions

Li, Yinghao Aaron, Han, Cong, Jiang, Xilin, Mesgarani, Nima

Jan-20-2023–arXiv.org Artificial Intelligence

Large-scale pre-trained language models have been shown to be helpful in improving the naturalness of text-to-speech (TTS) models by enabling them to produce more naturalistic prosodic patterns. However, these models are usually word-level or sup-phoneme-level and jointly trained with phonemes, making them inefficient for the downstream TTS task where only phonemes are needed. In this work, we propose a phoneme-level BERT (PL-BERT) with a pretext task of predicting the corresponding graphemes along with the regular masked phoneme predictions. Subjective evaluations show that our phoneme-level BERT encoder has significantly improved the mean opinion scores (MOS) of rated naturalness of synthesized speech compared with the state-of-the-art (SOTA) StyleTTS baseline on out-of-distribution (OOD) texts.

artificial intelligence, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

Jan-20-2023

arXiv.org PDF

Add feedback

Country:
- North America > United States (0.28)

Genre:
- Research Report
  - New Finding (0.69)
  - Experimental Study (0.69)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language (1.00)
  - Machine Learning (1.00)
  - Speech > Speech Synthesis (0.74)
  - Vision > Optical Character Recognition (0.72)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found