FELLE: Autoregressive Speech Synthesis with Token-Wise Coarse-to-Fine Flow Matching

Wang, Hui, Liu, Shujie, Meng, Lingwei, Li, Jinyu, Yang, Yifan, Zhao, Shiwan, Sun, Haiyang, Liu, Yanqing, Sun, Haoqin, Zhou, Jiaming, Lu, Yan, Qin, Yong

Feb-16-2025–arXiv.org Artificial Intelligence

To advance continuous-valued token modeling and temporal-coherence enforcement, we propose FELLE, an autoregressive model that integrates language modeling with token-wise flow matching. By leveraging the autoregressive nature of language models and the generative efficacy of flow matching, FELLE effectively predicts continuous-valued tokens (mel-spectrograms). For each continuous-valued token, FELLE modifies the general prior distribution in flow matching by incorporating information from the previous step, improving coherence and stability. Furthermore, to enhance synthesis quality, FELLE introduces a coarse-to-fine flow-matching mechanism, generating continuous-valued tokens hierarchically, conditioned on the language model's output. Experimental results demonstrate the potential of incorporating flow-matching techniques in autoregressive mel-spectrogram modeling, leading to significant improvements in TTS generation quality, as shown in https://aka.ms/felle.

arxiv preprint arxiv, large language model, machine learning, (15 more...)

arXiv.org Artificial Intelligence

Feb-16-2025

arXiv.org PDF

Add feedback

Genre:
- Research Report > New Finding (0.48)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks (0.68)
  - Natural Language > Large Language Model (0.48)
  - Speech > Speech Synthesis (0.69)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found