Goto

Collaborating Authors

 Krishnaswamy, Arvindh


PoCoNet: Better Speech Enhancement with Frequency-Positional Embeddings, Semi-Supervised Conversational Data, and Biased Loss

arXiv.org Machine Learning

These tend to have very large weight matrices Neural network applications generally benefit from larger-sized in the early layers, where the architecture could benefit from a models, but for current speech enhancement models, larger more hierarchical development of features. On the other hand, scale networks often suffer from decreased robustness to the in standard 2D U-Net models where kernels move in both the variety of real-world use cases beyond what is encountered in time and frequency directions [13], early layer activations are training data. We introduce several innovations that lead to better blind to what frequency they operate in - even in the case when large neural networks for speech enhancement. The novel padding is used, these early features' receptive fields have not PoCoNet architecture is a convolutional neural network that, yet reached the edges of the time-frequency image. Our proposed with the use of frequency-positional embeddings, is able to architecture has the advantages of both options, it is a more efficiently build frequency-dependent features in the early 2D U-Net (with DenseNet blocks and self-attention) with small layers. A semi-supervised method helps increase the amount kernels, and can therefore develop features hierarchically, but of conversational training data by pre-enhancing noisy datasets, can also take into account frequency information in early layers improving performance on real recordings.


Channel-Attention Dense U-Net for Multichannel Speech Enhancement

arXiv.org Machine Learning

Traditionally, beamforming techniques have been employed, where a linear spatial filter is estimated, per frequency, to boost the signal from the desired target direction while attenuating the interferences from other directions by utilizing second-order statistics, e.g., spatial covariance of speech and noise [1]. In recent years, deep learning (DL) based supervised speech enhancement techniques have achieved significant success [2], specifically for monaural/single-channel case. Motivated by this success, a recent line of work proposes to combine supervised single-channel techniques with unsupervised beamforming methods for multichan-nel case [3, 4]. These approaches are broadly known as neural beam-forming, where a neural network estimates the second-order statistics of speech and noise, using estimated time-frequency (TF) masks, after which the beamformer is applied to linearly combine the multi-channel mixture to produce clean speech. However, the performance of neural beamforming is limited by the nature of beamforming, a linear spatial filter per frequency bin. This work was done while B. Tolooshams and A. H. Song were interns at Amazon Web Services. Another line of work [5, 6] proposes to use spatial features along with spectral information to estimate TF masks. Most of these approaches have an explicit step to extract spatial features such as interchannel time/phase/level difference (ITD/IPD/ILD).