condeepmod
Speech Separation based on Contrastive Learning and Deep Modularization
The effectiveness of the use of general audio pre-trained models to boost speech separation has been explored in previous study with the main finding being that they provide minimal benefit when compared to features extracted without the models. It has been hypothesised that since the general audio pre-trained models were trained with clean audio dataset, they are unable to generalize to noisy and mixed speeches hence not effective in speech separation. This paper investigates this hypothesis by comparing the performance of pre-trained model trained on contaminated speeches and that trained on clean ones. We are interested in evaluating if contamination leads to better downstream performance. We also investigate if the type of input used to train the pre-trained model impacts the quality of embeddings it generates. To separate the sources, we propose a fully unsupervised technique of speech separation based on deep modularization. Our findings establish that by injecting noise and reverberation in the training dataset, the pre-trained model generate significantly better embeddings than when clean dataset is used. Further, based on the model presented here, working in short-time Fourier transform (STFT) results in better features than using time domain features. The deep modularization speech separation technique proposed is able to improve SI-SNRi and SDRi by 1.3 and 2.7 respectively when mixtures contain less than four sources and improves the results significantly for many source mixtures