FELLE: Autoregressive Speech Synthesis with Token-Wise Coarse-to-Fine Flow Matching
Wang, Hui, Liu, Shujie, Meng, Lingwei, Li, Jinyu, Yang, Yifan, Zhao, Shiwan, Sun, Haiyang, Liu, Yanqing, Sun, Haoqin, Zhou, Jiaming, Lu, Yan, Qin, Yong
–arXiv.org Artificial Intelligence
To advance continuous-valued token modeling and temporal-coherence enforcement, we propose FELLE, an autoregressive model that integrates language modeling with token-wise flow matching. By leveraging the autoregressive nature of language models and the generative efficacy of flow matching, FELLE effectively predicts continuous-valued tokens (mel-spectrograms). For each continuous-valued token, FELLE modifies the general prior distribution in flow matching by incorporating information from the previous step, improving coherence and stability. Furthermore, to enhance synthesis quality, FELLE introduces a coarse-to-fine flow-matching mechanism, generating continuous-valued tokens hierarchically, conditioned on the language model's output. Experimental results demonstrate the potential of incorporating flow-matching techniques in autoregressive mel-spectrogram modeling, leading to significant improvements in TTS generation quality, as shown in https://aka.ms/felle.
arXiv.org Artificial Intelligence
Feb-16-2025