Collaborating Authors

Multi-task Learning for Speaker Verification and Voice Trigger Detection Machine Learning

Automatic speech transcription and speaker recognition are usually treated as separate tasks even though they are interdependent. In this study, we investigate training a single network to perform both tasks jointly. We train the network in a supervised multi-task learning setup, where the speech transcription branch of the network is trained to minimise a phonetic connectionist temporal classification (CTC) loss while the speaker recognition branch of the network is trained to label the input sequence with the correct label for the speaker. We present a large-scale empirical study where the model is trained using several thousand hours of labelled training data for each task. We evaluate the speech transcription branch of the network on a voice trigger detection task while the speaker recognition branch is evaluated on a speaker verification task. Results demonstrate that the network is able to encode both phonetic \emph{and} speaker information in its learnt representations while yielding accuracies at least as good as the baseline models for each task, with the same number of parameters as the independent models.

Apple details AI to help voice assistants recognize hotwords and multilingual speakers


Speech recognition is an acute area of interest for Apple, whose cross-platform Siri virtual assistant is used by over 500 million customers worldwide. This past week, the tech giant published a series of preprint research papers investigating techniques to improve voice trigger detection and speaker verification, as well as language identification for multiple speakers. In the first of the papers, a team of Apple researchers propose an AI model trained to perform both the task of automatic speech recognition and speaker recognition. As they explain in the abstract, the commands recognized by speech-based personal assistants are usually prefixed with a trigger phrase (e.g., "Hey, Siri"), and detecting this trigger phrase involves two steps. The AI first must decide whether the phonetic content in the input audio matches that of the trigger phrase (voice trigger detection), and then it must determine whether the speaker's voice matches the voice of a registered user or users (speaker verification).

Voice trigger detection from LVCSR hypothesis lattices using bidirectional lattice recurrent neural networks Machine Learning

We propose a method to reduce false voice triggers of a speech-enabled personal assistant by post-processing the hypothesis lattice of a server-side large-vocabulary continuous speech recognizer (LVCSR) via a neural network. We first discuss how an estimate of the posterior probability of the trigger phrase can be obtained from the hypothesis lattice using known techniques to perform detection, then investigate a statistical model that processes the lattice in a more explicitly data-driven, discriminative manner. We propose using a Bidirectional Lattice Recurrent Neural Network (LatticeRNN) for the task, and show that it can significantly improve detection accuracy over using the 1-best result or the posterior.

Google Home Routines: How to put them to use


Unless you've dug deep into the settings menu for Google Home, you might not know about the smart speaker's most powerful feature. It's called Routines, and it allows you to execute multiple actions with a single voice command. For example, you can have Google Assistant announce the weather, a personalized traffic report, and news updates while you get ready for work, or have it dim your smart light bulbs and play some relaxing music a few minutes before bedtime. These routines even work with the Google Assistant app on iOS and Android--no smart speaker required. You can also schedule Routines to run at specific times without voice commands, effectively turning a Google Home speaker into a high-tech alarm clock that can wake you up with music, information, and smart home automations.

Bidirectional recurrent neural networks for seismic event detection Artificial Intelligence

Real time, accurate passive seismic event detection is a critical safety measure across a range of monitoring applications from reservoir stability to carbon storage to volcanic tremor detection. The most common detection procedure remains the Short-Term-Average to Long-Term-Average (STA/LTA) trigger despite its common pitfalls of requiring a signal-to-noise ratio greater than one and being highly sensitive to the trigger parameters. Whilst numerous alternatives have been proposed, they often are tailored to a specific monitoring setting and therefore cannot be globally applied, or they are too computationally expensive therefore cannot be run real time. This work introduces a deep learning approach to event detection that is an alternative to the STA/LTA trigger. A bi-directional, long-short-term memory, neural network is trained solely on synthetic traces. Evaluated on synthetic and field data, the neural network approach significantly outperforms the STA/LTA trigger both on the number of correctly detected arrivals as well as on reducing the number of falsely detected events. Its real time applicability is proven with 600 traces processed in real time on a single processing unit.