Streaming Non-Autoregressive Model for Accent Conversion and Pronunciation Improvement

Nguyen, Tuan-Nam, Pham, Ngoc-Quan, Akti, Seymanur, Waibel, Alexander

arXiv.org Artificial Intelligence 

We propose a first streaming accent conversion (AC) model that transforms non-native speech into a native-like accent while preserving speaker identity, prosody and improving pronunciation. Our approach enables stream processing by modifying a previous AC architecture with an Emformer encoder and an optimized inference mechanism. Additionally, we integrate a native text-to-speech (TTS) model to generate ideal ground-truth data for efficient training. Our streaming AC model achieves comparable performance to the top AC models while maintaining stable latency, making it the first AC system capable of streaming.