Personal VAD: Speaker-Conditioned Voice Activity Detection
Ding, Shaojin, Wang, Quan, Chang, Shuo-yiin, Wan, Li, Moreno, Ignacio Lopez
ABSTRACT In this paper, we propose "personal V AD", a system to detect the voice activity of a target speaker at the frame level. This system is useful for gating the inputs to a streaming speech recognition system, such that it only triggers for the target user, which helps reduce the computational cost and battery consumption. We achieve this by training a V ADalike neural network that is conditioned on the target speaker embedding or the speaker verification score. With our optimal setup, we are able to train a 130KB model that outperforms a baseline system where individually trained standard V AD and speaker recognition network are combined to perform the same task. Index T erms-- Personal V AD, voice activity detection, speaker recognition, speech recognition 1. INTRODUCTION In modern speech processing systems, voice activity detection (V AD) usually lives in the upstream of other speech components such as speech recognition and speaker recognition. As a gating module, V AD not only improves the performance of downstream components by discarding non-speech signal, but also significantly reduces the overall computational cost due to its relatively small size.
Aug-12-2019
- Country:
- North America > United States > Texas (0.14)
- Genre:
- Research Report > New Finding (0.46)
- Industry:
- Energy (0.34)
- Information Technology (0.46)
- Technology: