Speech-to-Text Translation with Phoneme-Augmented CoT: Enhancing Cross-Lingual Transfer in Low-Resource Scenarios
Gállego, Gerard I., Pareras, Oriol, Garcia, Martí Cortada, Takanori, Lucas, Hernando, Javier
–arXiv.org Artificial Intelligence
We propose a Speech-to-Text Translation (S2TT) approach that integrates phoneme representations into a Chain-of-Thought (CoT) framework to improve translation in low-resource and zero-resource settings. By introducing phoneme recognition as an intermediate step, we enhance cross-lingual transfer, enabling translation even for languages with no labeled speech data. Our system builds on a multilingual LLM, which we extend to process speech and phonemes. Training follows a curriculum learning strategy that progressively introduces more complex tasks. Experiments on multilingual S2TT benchmarks show that phoneme-augmented CoT improves translation quality in low-resource conditions and enables zero-resource translation, while slightly impacting high-resource performance. Despite this trade-off, our findings demonstrate that phoneme-based CoT is a promising step toward making S2TT more accessible across diverse languages.
arXiv.org Artificial Intelligence
Sep-30-2025
- Country:
- Asia
- Middle East > UAE
- Abu Dhabi Emirate > Abu Dhabi (0.04)
- Singapore (0.04)
- Middle East > UAE
- Europe
- France > Provence-Alpes-Côte d'Azur
- Bouches-du-Rhône > Marseille (0.04)
- Spain > Catalonia
- Barcelona Province > Barcelona (0.04)
- France > Provence-Alpes-Côte d'Azur
- North America > United States
- Florida > Miami-Dade County
- Miami (0.04)
- Pennsylvania > Philadelphia County
- Philadelphia (0.04)
- Washington > King County
- Seattle (0.04)
- Florida > Miami-Dade County
- Asia
- Genre:
- Research Report > New Finding (1.00)
- Technology:
- Information Technology > Artificial Intelligence
- Machine Learning (1.00)
- Natural Language > Machine Translation (1.00)
- Speech > Speech Recognition (1.00)
- Information Technology > Artificial Intelligence