Channel-Attention Dense U-Net for Multichannel Speech Enhancement
Tolooshams, Bahareh, Giri, Ritwik, Song, Andrew H., Isik, Umut, Krishnaswamy, Arvindh
Traditionally, beamforming techniques have been employed, where a linear spatial filter is estimated, per frequency, to boost the signal from the desired target direction while attenuating the interferences from other directions by utilizing second-order statistics, e.g., spatial covariance of speech and noise [1]. In recent years, deep learning (DL) based supervised speech enhancement techniques have achieved significant success [2], specifically for monaural/single-channel case. Motivated by this success, a recent line of work proposes to combine supervised single-channel techniques with unsupervised beamforming methods for multichan-nel case [3, 4]. These approaches are broadly known as neural beam-forming, where a neural network estimates the second-order statistics of speech and noise, using estimated time-frequency (TF) masks, after which the beamformer is applied to linearly combine the multi-channel mixture to produce clean speech. However, the performance of neural beamforming is limited by the nature of beamforming, a linear spatial filter per frequency bin. This work was done while B. Tolooshams and A. H. Song were interns at Amazon Web Services. Another line of work [5, 6] proposes to use spatial features along with spectral information to estimate TF masks. Most of these approaches have an explicit step to extract spatial features such as interchannel time/phase/level difference (ITD/IPD/ILD).
Jan-30-2020
- Country:
- North America > United States (1.00)
- Genre:
- Research Report (0.64)
- Industry:
- Information Technology (0.36)
- Technology: