MixedG2P-T5: G2P-free Speech Synthesis for Mixed-script texts using Speech Self-Supervised Learning and Language Model

Park, Joonyong, Saito, Daisuke, Minematsu, Nobuaki

arXiv.org Artificial Intelligence 

--This study presents a novel approach to voice synthesis that can substitute the traditional grapheme-to-phoneme (G2P) conversion by using a deep learning-based model that generates discrete tokens directly from speech. Utilizing a pre-trained voice SSL model, we train a T5 encoder to produce pseudo-language labels from mixed-script texts (e.g., containing Kanji and Kana). This method eliminates the need for manual phonetic transcription, reducing costs and enhancing scalability, especially for large non-transcribed audio datasets. Our model matches the performance of conventional G2P-based text-to-speech systems and is capable of synthesizing speech that retains natural linguistic and paralinguistic features, such as accents and intonations. Speech synthesis refers to the technology by which machines automatically generate speech audio signals and is commonly known as text-to-speech (TTS).