Enhanced Factored Three-Way Restricted Boltzmann Machines for Speech Detection
Speech detection (SD) greatly improves the separation of speech sources from background interferes [1]. Nowadays, SD techniques attract intense attentions in a general speech processing framework, including automatic speech recognition (ASR) [2], speech enhancement [3] and speech coding [1]. Recently, deep neural network (DNN) based 1D SD algorithms show great advantages over conventional voice activity detectors [4], [5]. The obvious benefits of such approaches lie on their easy integration into ASR, robust performance, and feature fusion capability. Zhang and Wu [4] introduced deep belief network and used stacked Bernoulli-Bernoulli restricted Boltzmann machines (RBMs) to conduct the 1D SD. The idea that incorporating temporal context correlation to strengthen the dynamical detection is widely used in network structure design [6], [7]. Other DNN based 1D SD strategies might either focus on improving the front-end acoustic feature inputs (e.g., acoustic models and statistical models) [8], [9], or exploiting the supervised network structure in terms of sample training [10]. These DNN based approaches rely on comprehensive network training, and then are applied to binarily label the speech activities in the time domain. However, 1D SD methods integrate frequency features, and cannot reveal information in the joint time-frequency domain, which are generally more expressive on speech activities, compared with the binary values in 1D SD approaches.
Apr-20-2017