On permutation invariant training for speech source separation

Liu, Xiaoyu, Pons, Jordi

arXiv.org Artificial Intelligence 

Deep CASA, an spectrogram-based model, to Conv-TasNet, which uses very short waveform frames (such as 2 ms). We find that tPIT We study permutation invariant training (PIT), which targets at the based on such short waveform frames can be challenging. Therefore, permutation ambiguity problem for speaker independent source separation we propose performing tPIT in a pre-trained latent space--which models. We extend two state-of-the-art PIT strategies. First, allows for a more meaningful feature space for tPIT than the short we look at the two-stage speaker separation and tracking algorithm waveform frames. Further, when training the clustering model, Deep based on frame level PIT (tPIT) and clustering, which was originally CASA employs a memory and computationally expensive pairwise proposed for the STFT domain, and we adapt it to work with similarity loss that does not scale for waveform inputs. We propose waveforms and over a learned latent space. Further, we propose an a loss that reduces the complexity from quadratic to linear, making efficient clustering loss scalable to waveform models.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found