You Sound a Little Tense: L2 Tailored Clear TTS Using Durational Vowel Properties
Tuttösí, Paige, Yeung, H. Henny, Wang, Yue, Aucouturier, Jean-Julien, Lim, Angelica
–arXiv.org Artificial Intelligence
We present the first text-to-speech (TTS) system tailored to second language (L2) speakers. We use duration differences between American English tense (longer) and lax (shorter) vowels to create a "clarity mode" for Matcha-TTS. Our perception studies showed that French-L1, English-L2 listeners the participants had fewer (at least 9.15%) transcription errors when using our clarity mode, and found it more encouraging and respectful than overall slowed down speech. Remarkably, listeners were not aware of these effects: despite the decreased word error rate in clarity mode, listeners still believed that slowing all target words was the most intelligible, suggesting that actual intelligibility does not correlate with perceived intelligibility. Additionally, we found that Whisper-ASR did not use the same cues as L2 speakers to differentiate difficult vowels and is not sufficient to assess the intelligibility of TTS systems for these individuals.
arXiv.org Artificial Intelligence
Sep-4-2025
- Country:
- Asia > Taiwan (0.04)
- Europe > France (0.04)
- North America
- Canada > British Columbia
- Metro Vancouver Regional District > Burnaby (0.04)
- United States (0.04)
- Canada > British Columbia
- Genre:
- Research Report > New Finding (0.48)
- Technology:
- Information Technology > Artificial Intelligence
- Machine Learning > Neural Networks (0.46)
- Natural Language (1.00)
- Speech > Speech Synthesis (0.35)
- Information Technology > Artificial Intelligence