Personal VAD: Speaker-Conditioned Voice Activity Detection

Ding, Shaojin, Wang, Quan, Chang, Shuo-yiin, Wan, Li, Moreno, Ignacio Lopez

Aug-12-2019–arXiv.org Machine Learning

ABSTRACT In this paper, we propose "personal V AD", a system to detect the voice activity of a target speaker at the frame level. This system is useful for gating the inputs to a streaming speech recognition system, such that it only triggers for the target user, which helps reduce the computational cost and battery consumption. We achieve this by training a V ADalike neural network that is conditioned on the target speaker embedding or the speaker verification score. With our optimal setup, we are able to train a 130KB model that outperforms a baseline system where individually trained standard V AD and speaker recognition network are combined to perform the same task. Index T erms-- Personal V AD, voice activity detection, speaker recognition, speech recognition 1. INTRODUCTION In modern speech processing systems, voice activity detection (V AD) usually lives in the upstream of other speech components such as speech recognition and speaker recognition. As a gating module, V AD not only improves the performance of downstream components by discarding non-speech signal, but also significantly reduces the overall computational cost due to its relatively small size.

artificial intelligence, speech recognition, target speaker, (13 more...)

arXiv.org Machine Learning

Aug-12-2019

arXiv.org PDF

Add feedback

Country:
- North America > United States (0.46)

Genre:
- Research Report > New Finding (0.46)

Industry:
- Information Technology (0.46)
- Energy (0.34)

Technology:
- Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found