We present a number of Time-Delay Neural Network (TDNN) based architectures for multi-speaker phoneme recognition (/b,d,g/ task). We use speech of two females and four males to compare the performance of the various architectures against a baseline recognition rate of 95.9% for a single IDNN on the six-speaker /b,d,g/ task.
This paper summarizes the applied deep learning practices in the field of speaker recognition, both verification and identification. Speaker recognition has been a widely used field topic of speech technolog y. Many research works have been carried out and little progress has been achieved in the past 5 - 6 years. However, as deep learning techniques do advance in most machine learning fields, the former state - of - the - art methods are getting replaced by them in s peaker recognition too. It seems that DL becomes the now state - of - the - art solution for both speaker verification and identification. The standard x - vectors, additional to i - vectors, are used as baseline in most of the novel works. The increasing amount of gathered data opens up the territory to DL, where they are the most effective.
The performance of automatic speaker recognition systems degrades when facing distorted speech data containing additive noise and/or reverberation. Statistical uncertainty propagation has been introduced as a promising paradigm to address this challenge. So far, different uncertainty propagation methods have been proposed to compensate noise and reverberation in i-vectors in the context of speaker recognition. They have achieved promising results on small datasets such as YOHO and Wall Street Journal, but little or no improvement on the larger, highly variable NIST Speaker Recognition Evaluation (SRE) corpus. In this paper, we propose a complete uncertainty propagation method, whereby we model the effect of uncertainty both in the computation of unbiased Baum-Welch statistics and in the derivation of the posterior expectation of the i-vector. We conduct experiments on the NIST-SRE corpus mixed with real domestic noise and reverberation from the CHiME-2 corpus and preprocessed by multichannel speech enhancement. The proposed method improves the equal error rate (EER) by 4% relative compared to a conventional i-vector based speaker verification baseline. This is to be compared with previous methods which degrade performance.
ABSTRACT In this paper, we propose "personal V AD", a system to detect the voice activity of a target speaker at the frame level. This system is useful for gating the inputs to a streaming speech recognition system, such that it only triggers for the target user, which helps reduce the computational cost and battery consumption. We achieve this by training a V ADalike neural network that is conditioned on the target speaker embedding or the speaker verification score. With our optimal setup, we are able to train a 130KB model that outperforms a baseline system where individually trained standard V AD and speaker recognition network are combined to perform the same task. Index T erms-- Personal V AD, voice activity detection, speaker recognition, speech recognition 1. INTRODUCTION In modern speech processing systems, voice activity detection (V AD) usually lives in the upstream of other speech components such as speech recognition and speaker recognition. As a gating module, V AD not only improves the performance of downstream components by discarding non-speech signal, but also significantly reduces the overall computational cost due to its relatively small size.