The Effects of Input Type and Pronunciation Dictionary Usage in Transfer Learning for Low-Resource Text-to-Speech
Do, Phat, Coler, Matt, Dijkstra, Jelske, Klabbers, Esther
–arXiv.org Artificial Intelligence
We compare phone labels and articulatory features as input for cross-lingual transfer learning in text-to-speech (TTS) for low-resource languages (LRLs). Experiments with FastSpeech 2 and the LRL West Frisian show that using articulatory features outperformed using phone labels in both intelligibility and naturalness. For LRLs without pronunciation dictionaries, we propose two novel approaches: a) using a massively multilingual model to convert grapheme-to-phone (G2P) in both training and synthesizing, and b) using a universal phone recognizer to create a makeshift dictionary. Results show that the G2P approach performs largely on par with using a ground-truth dictionary and the phone recognition approach, while performing generally worse, remains a viable option for LRLs less suitable for the G2P approach. Within each approach, using articulatory features as input outperforms using phone labels.
arXiv.org Artificial Intelligence
Jun-1-2023
- Country:
- Europe
- France > Provence-Alpes-Côte d'Azur
- Bouches-du-Rhône > Marseille (0.04)
- Ireland > Leinster
- County Dublin > Dublin (0.05)
- Netherlands (0.05)
- Slovenia (0.04)
- France > Provence-Alpes-Côte d'Azur
- North America
- Canada > Quebec
- Montreal (0.04)
- United States
- California > San Diego County
- San Diego (0.04)
- New York > New York County
- New York City (0.04)
- California > San Diego County
- Canada > Quebec
- Europe
- Genre:
- Research Report > New Finding (0.34)
- Technology:
- Information Technology > Artificial Intelligence
- Machine Learning > Transfer Learning (0.63)
- Natural Language (1.00)
- Speech > Speech Synthesis (0.72)
- Vision > Optical Character Recognition (0.62)
- Information Technology > Artificial Intelligence