Quantitative Evidence on Overlooked Aspects of Enrollment Speaker Embeddings for Target Speaker Separation
Liu, Xiaoyu, Li, Xu, Serrà, Joan
–arXiv.org Artificial Intelligence
FBANK, as a Single channel target speaker separation (TSS) aims at extracting simple signal processing method, has been ignored as an enrollment a speaker's voice from a mixture of multiple talkers given an enrollment option in previous literature. SSL are a class of powerful models that utterance of that speaker. A typical deep learning TSS learn problem-agnostic speech features from unlabelled data [12-framework consists of an upstream model that obtains enrollment 14], and we hypothesize that such broader information (compared to speaker embeddings and a downstream model that performs the separation SID) could benefit TSS enrollment. Note that, unlike [15], which conditioned on the embeddings. In this paper, we look into uses SSL as the input mixture features for blind speaker separation, several important but overlooked aspects of the enrollment embeddings, we limit SSL to offline processing the enrollment utterance, since including the suitability of the widely used speaker identification TSS often requires real-time low-complexity processing for the mixtures embeddings, the introduction of the log-mel filterbank and selfsupervised [2-5]. Finally, we consider a cross-dataset evaluation to assess embeddings, and the embeddings' cross-dataset generalization the generalization of the enrollment embeddings [16], which is another capability. Our results show that the speaker identification important but overlooked aspect in previous TSS research.
arXiv.org Artificial Intelligence
Oct-26-2022