Empowering Global Voices: A Data-Efficient, Phoneme-Tone Adaptive Approach to High-Fidelity Speech Synthesis

Geng, Yizhong, Xu, Jizhuo, Liang, Zeyu, Yang, Jinghan, Shi, Xiaoyi, Shen, Xiaoyu

Apr-11-2025–arXiv.org Artificial Intelligence

Text-to-speech (TTS) technology has achieved impressive results for widely spoken languages, yet many under-resourced languages remain challenged by limited data and linguistic complexities. In this paper, we present a novel methodology that integrates a data-optimized framework with an advanced acoustic model to build high-quality TTS systems for low-resource scenarios. We demonstrate the effectiveness of our approach using Thai as an illustrative case, where intricate phonetic rules and sparse resources are effectively addressed. Our method enables zero-shot voice cloning and improved performance across diverse client applications, ranging from finance to healthcare, education, and law. Extensive evaluations - both subjective and objective - confirm that our model meets state-of-the-art standards, offering a scalable solution for TTS production in data-limited settings, with significant implications for broader industry adoption and multilingual accessibility.

arxiv preprint arxiv, large language model, machine learning, (16 more...)

arXiv.org Artificial Intelligence

Apr-11-2025

arXiv.org PDF

Add feedback

Country:
- Asia (0.46)

Genre:
- Research Report (0.50)

Industry:
- Education (0.93)
- Information Technology (0.88)
- Media (0.68)
- Law (0.68)

Technology:
- Information Technology > Artificial Intelligence
  - Speech > Speech Synthesis (0.73)
  - Natural Language > Large Language Model (0.68)
  - Machine Learning > Neural Networks (0.48)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found