Collaborating Authors

acoustic processing

Google Assistant can now use your voice to verify purchases


Making purchases with your voice is convenient, but it's far from secure. Google is attempting to change that when using Assistant by introducing an optional voice verification test. As The Verge reports, the new security feature relies on Google Assistant's Voice Match and it's being rolled out slowly as part of a limited pilot program to test how well it works with smart speakers and smart displays. The Voice Match training feature was updated recently to include phrases so that Assistant could more accurately determine who is issuing commands. With better accuracy, Google clearly feels Voice Match is good enough to now act as an extra layer of security.

Multi-Scale Aggregation Using Feature Pyramid Module for Text-Independent Speaker Verification Machine Learning

Currently, the most widely used approach for speaker verification is the deep speaker embedding learning. In this approach, convolutional neural networks are mainly used as a frame-level feature extractor, and speaker embeddings are extracted from the last layer of the feature extractor. Multi-scale aggregation (MSA), which utilizes multi-scale features from different layers of the feature extractor, has recently been introduced into the approach and has shown improved performance for both short and long utterances. This paper improves the MSA by using a feature pyramid module, which enhances speaker-discriminative information of features at multiple layers via a top-down pathway and lateral connections. We extract speaker embeddings using the enhanced features that contain rich speaker information at different resolutions. Experiments on the VoxCeleb dataset show that the proposed module improves previous MSA methods with a smaller number of parameters, providing better performance than state-of-the-art approaches.

A Comparison of Metric Learning Loss Functions for End-To-End Speaker Verification Machine Learning

Despite the growing popularity of metric learning approaches, very little work has attempted to perform a fair comparison of these techniques for speaker verification. We try to fill this gap and compare several metric learning loss functions in a systematic manner on the VoxCeleb dataset. The first family of loss functions is derived from the cross entropy loss (usually used for supervised classification) and includes the congenerous cosine loss, the additive angular margin loss, and the center loss. The second family of loss functions focuses on the similarity between training samples and includes the contrastive loss and the triplet loss. We show that the additive angular margin loss function outperforms all other loss functions in the study, while learning more robust representations. Based on a combination of SincNet trainable features and the x-vector architecture, the network used in this paper brings us a step closer to a really-end-to-end speaker verification system, when combined with the additive angular margin loss, while still being competitive with the x-vector baseline. In the spirit of reproducible research, we also release open source Python code for reproducing our results, and share pretrained PyTorch models on torch.hub that can be used either directly or after fine-tuning.

Speaker Identification using EEG Machine Learning

In this paper we explore speaker identification using electroencephalography (EEG) signals. The performance of speaker identification systems degrades in presence of background noise, this paper demonstrates that EEG features can be used to enhance the performance of speaker identification systems operating in presence and absence of background noise. The paper further demonstrates that in presence of high background noise, speaker identification system using only EEG features as input demonstrates better performance than the system using only acoustic features as input.

A Speaker Verification Backend for Improved Calibration Performance across Varying Conditions Machine Learning

In a recent work, we presented a discriminative backend for speaker verification that achieved good out-of-the-box calibration performance on most tested conditions containing varying levels of mismatch to the training conditions. This backend mimics the standard PLDA-based backend process used in most current speaker verification systems, including the calibration stage. All parameters of the backend are jointly trained to optimize the binary cross-entropy for the speaker verification task. Calibration robustness is achieved by making the parameters of the calibration stage a function of vectors representing the conditions of the signal, which are extracted using a model trained to predict condition labels. In this work, we propose a simplified version of this backend where the vectors used to compute the calibration parameters are estimated within the backend, without the need for a condition prediction model. We show that this simplified method provides similar performance to the previously proposed method while being simpler to implement, and having less requirements on the training data. Further, we provide an analysis of different aspects of the method including the effect of initialization, the nature of the vectors used to compute the calibration parameters, and the effect that the random seed and the number of training epochs has on performance. We also compare the proposed method with the trial-based calibration (TBC) method that, to our knowledge, was the state-of-the-art for achieving good calibration across varying conditions. We show that the proposed method outperforms TBC while also being several orders of magnitude faster to run, comparable to the standard PLDA baseline.

Facial and voice recognition in cars sounds like a privacy nightmare


I plopped into the front seat, expecting to laugh in the face of the machine attempting to measure my age, gender, emotional state, and comfort level all through infrared cameras and other sensors. But sitting expectantly in the car, equipped with French automotive software company Valeo's Smart Cocoon 4.0 system, I was flabbergasted when it pinpointed my exact age. Getting that number right made me trust the car's biometric system more than I probably should have, even as tools that measure your heartbeat, track your eyes, head position, voice, and more enter vehicles everywhere. At CES this year, driver and passenger monitoring kept popping up. It's a preview of what will become commonplace in the driver's seat in the coming years.

iFLyTek develops voice recognition for law enforcement; moves forward with AI innovation despite U.S. ban


Chinese startup iFlyTek boasts it has created for law enforcement AI technology that leverages voice biometrics to identify a person, writes Nikkei Asian Review. In upcoming years, iFlyTek aims to use it in fighting phone scams after rolling out the voiceprint recognition system across the country. "Because recordings are important evidence when it comes to phone scams, demand for voice recognition is growing," said Fu Zhonghua, the deputy head of iFlyTek's research center. Fu further states that the technology is aimed to be used in law enforcement and phone monitoring to identify scammers' voiceprints and hang up, but it can also be successfully implemented in finance. Government-owned China Construction Bank is already using voiceprints to verify customer identity alongside passwords.

China's iFlytek claims breakthrough in AI-powered voice recognition


Chinese artificial intelligence startup iFlytek says it has developed AI-powered technology that can accurately identify a person by his or her voice, for use in law enforcement. The company expects to be able to roll out a voiceprint recognition system nationwide in two to three years, said Fu Zhonghua, the deputy head of iFlytek's research center here. The Chinese market for such technology has the prospect of becoming a driver of earnings growth for iFlytek, which has been hit with U.S. sanctions for its alleged role in China's internationally criticized treatment of Muslim minorities. "Because recordings are important evidence when it comes to phone scams, demand for voice recognition is growing," Fu told reporters at the lab. The voiceprint recognition tool harnesses iFlytek's strength in using AI to analyze data.

Now you can get Google's real-time audio transcription app on older Pixels – here's how


One of the benefits of investing in a new flagship handset such as the Pixel 4, is that it gives you access to exclusive apps and features. Just as the likes of Samsung and OnePlus do, Google gives owners of its latest Pixel devices new toys to play with, and with the Pixel 4 this included the Recorder app. More than just a simple audio recording tool, Recorder also uses AI and voice recognition to automatically transcribe and label recordings in real time to make them far more useful. Now there's good news for anyone packing a Pixel 2, Pixel 3 or Pixel 3a: the Recorder app is no longer exclusive to the Pixel 4. The automatic transcription offered by the Recorder app is great for taking minutes of meetings, dictating documents and much more. The app is now being opened up to a wider range of users, giving many more people the chance to avoid the laborious task of transcribing recordings by hand.

Otter Starter Guide


Select the conversation in which you want to tag speakers. Otter will automatically tag speakers who have previously been identified. For new speakers, please teach Otter their voice by identifying them in the conversation. Select the unknown speaker icon to start identifying the speaker. Otter will list recent speakers for you to choose.