LauraTSE: Target Speaker Extraction using Auto-Regressive Decoder-Only Language Models

Tang, Beilong, Zeng, Bang, Li, Ming

arXiv.org Artificial Intelligence 

--We propose LauraTSE, an Auto-Regressive Decoder-Only Language Model for T arget Speaker Extraction built upon the LauraGPT backbone. LauraTSE employs a small-scale auto-regressive decoder-only language model that generates the initial layers of the target speech's discrete codec representations from the continuous embeddings of both the mixture and reference speech. These outputs serve as coarse-grained predictions. T o refine them, a one-step encoder-only language model reconstructs the full codec representation by integrating information from both the mixture and the reference speech, adding fine-grained details. Experimental results show that our approach can achieve promising performance. Additionally, we conduct ablation studies to investigate the data scalability and the contribution of the encoder-only model. Target Speaker Extraction (TSE) aims at extracting target speaker's speech from a mixture using auxiliary information like reference speech, spatial information, or visual information etc., regarding the target speaker [1]. Current dominant approaches utilize discriminative models which try to directly map the mixture speech to target clean speech [2]- [5]. However, this method might struggle for unseen data and sometimes even introduce undesirable distortions [6].

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found