Goto

Collaborating Authors

 Speech Recognition


Deep Speaker Embeddings for Far-Field Speaker Recognition on Short Utterances

Gusev, Aleksei, Volokhov, Vladimir, Andzhukaev, Tseren, Novoselov, Sergey, Lavrentyeva, Galina, Volkova, Marina, Gazizullina, Alice, Shulipa, Andrey, Gorlanov, Artem, Avdeeva, Anastasia, Ivanov, Artem, Kozlov, Alexander, Pekhovsky, Timur, Matveev, Yuri

arXiv.org Machine Learning

Speaker recognition systems based on deep speaker embeddings have achieved significant performance in controlled conditions according to the results obtained for early NIST SRE (Speaker Recognition Evaluation) datasets. From the practical point of view, taking into account the increased interest in virtual assistants (such as Amazon Alexa, Google Home, AppleSiri, etc.), speaker verification on short utterances in uncontrolled noisy environment conditions is one of the most challenging and highly demanded tasks. This paper presents approaches aimed to achieve two goals: a) improve the quality of far-field speaker verification systems in the presence of environmental noise, reverberation and b) reduce the system qualitydegradation for short utterances. For these purposes, we considered deep neural network architectures based on TDNN (TimeDelay Neural Network) and ResNet (Residual Neural Network) blocks. We experimented with state-of-the-art embedding extractors and their training procedures. Obtained results confirm that ResNet architectures outperform the standard x-vector approach in terms of speaker verification quality for both long-duration and short-duration utterances. We also investigate the impact of speech activity detector, different scoring models, adaptation and score normalization techniques. The experimental results are presented for publicly available data and verification protocols for the VoxCeleb1, VoxCeleb2, and VOiCES datasets.


x-vectors meet emotions: A study on dependencies between emotion and speaker recognition

Pappagari, Raghavendra, Wang, Tianzi, Villalba, Jesus, Chen, Nanxin, Dehak, Najim

arXiv.org Machine Learning

In this work, we explore the dependencies between speaker recognition and emotion recognition. We first show that knowledge learned for speaker recognition can be reused for emotion recognition through transfer learning. Then, we show the effect of emotion on speaker recognition. For emotion recognition, we show that using a simple linear model is enough to obtain good performance on the features extracted from pre-trained models such as the x-vector model. Then, we improve emotion recognition performance by fine-tuning for emotion classification. We evaluated our experiments on three different types of datasets: IEMOCAP, MSP-Podcast, and Crema-D. By fine-tuning, we obtained 30.40%, 7.99%, and 8.61% absolute improvement on IEMOCAP, MSP-Podcast, and Crema-D respectively over baseline model with no pre-training. Finally, we present results on the effect of emotion on speaker verification. We observed that speaker verification performance is prone to changes in test speaker emotions. We found that trials with angry utterances performed worst in all three datasets. We hope our analysis will initiate a new line of research in the speaker recognition community.


Automatic Speech Transcription And Speaker Recognition Simultaneously Using Apple AI

#artificialintelligence

Last year, Apple witnessed several controversies regarding its speech recognition technology. To provide quality control in the company's voice assistant Siri, Apple asked its contractors to regularly hear the confidential voice recordings in the name of the "Siri Grading Program". However, to this matter, the company later apologised and published a statement where it announced the changes in the Siri grading program. This year, the tech giant has been gearing up a number of researchers regarding speech recognition technology to upgrade its voice assistant. Recently, the researchers at Apple developed an AI model which can perform automatic speech transcription and speaker recognition simultaneously.


VoxSRC 2019: The first VoxCeleb Speaker Recognition Challenge

Chung, Joon Son, Nagrani, Arsha, Coto, Ernesto, Xie, Weidi, McLaren, Mitchell, Reynolds, Douglas A, Zisserman, Andrew

arXiv.org Machine Learning

ABSTRACT The V oxCeleb Speaker Recognition Challenge 2019 aimed to assess how well current speaker recognition technology is able to identify speakers in unconstrained or'in the wild' data. It consisted of: (i) a publicly available speaker recognition dataset from Y ouTube videos together with ground truth annotation and standardised evaluation software; and (ii) a public challenge and workshop held at Interspeech 2019 in Graz, Austria. This paper outlines the challenge and provides its baselines, results and discussions. Index T erms-- speaker verification, unconstrained conditions 1. INTRODUCTION The V oxCeleb Speaker Recognition Challenge (V oxSRC) 2019 was the first of a new series of speaker recognition challenges that are intended to be hosted annually. V oxSRC 2019 consisted of: (i) a publicly available speaker recognition dataset with speech segments'in the wild', together with ground truth annotations and standardised evaluation software; and (ii) a public challenge and workshop held at Interspeech 2019 in Graz, Austria.


A Deep Neural Network for Short-Segment Speaker Recognition

Hajavi, Amirhossein, Etemad, Ali

arXiv.org Machine Learning

Today's interactive devices such as smart-phone assistants and smart speakers often deal with short-duration speech segments. As a result, speaker recognition systems integrated into such devices will be much better suited with models capable of performing the recognition task with short-duration utterances. In this paper, a new deep neural network, UtterIdNet, capable of performing speaker recognition with short speech segments is proposed. Our proposed model utilizes a novel architecture that makes it suitable for short-segment speaker recognition through an efficiently increased use of information in short speech segments. UtterIdNet has been trained and tested on the V oxCeleb datasets, the latest benchmarks in speaker recognition. Evaluations for different segment durations show consistent and stable performance for short segments, with significant improvement over the previous models for segments of 2 seconds, 1 second, and especially sub-second durations (250 ms and 500 ms).


An improved uncertainty propagation method for robust i-vector based speaker recognition

Ribas, Dayana, Vincent, Emmanuel

arXiv.org Artificial Intelligence

The performance of automatic speaker recognition systems degrades when facing distorted speech data containing additive noise and/or reverberation. Statistical uncertainty propagation has been introduced as a promising paradigm to address this challenge. So far, different uncertainty propagation methods have been proposed to compensate noise and reverberation in i-vectors in the context of speaker recognition. They have achieved promising results on small datasets such as YOHO and Wall Street Journal, but little or no improvement on the larger, highly variable NIST Speaker Recognition Evaluation (SRE) corpus. In this paper, we propose a complete uncertainty propagation method, whereby we model the effect of uncertainty both in the computation of unbiased Baum-Welch statistics and in the derivation of the posterior expectation of the i-vector. We conduct experiments on the NIST-SRE corpus mixed with real domestic noise and reverberation from the CHiME-2 corpus and preprocessed by multichannel speech enhancement. The proposed method improves the equal error rate (EER) by 4% relative compared to a conventional i-vector based speaker verification baseline. This is to be compared with previous methods which degrade performance.


Can We Use Speaker Recognition Technology to Attack Itself? Enhancing Mimicry Attacks Using Automatic Target Speaker Selection

Kinnunen, Tomi, Hautamäki, Rosa González, Vestman, Ville, Sahidullah, Md

arXiv.org Machine Learning

ABSTRACT We consider technology-assisted mimicry attacks in the context of automatic speaker verification (ASV). We use ASV itself to select targeted speakers to be attacked by human-based mimicry. We recorded 6 naive mimics for whom we select target celebrities from VoxCeleb1 and VoxCeleb2 corpora (7,365 potential targets) using an i-vector system. The attacker attempts to mimic the selected target, with the utterances subjected to ASV tests using an independently developed x-vector system. Our main finding is negative: even if some of the attacker scores against the target speakers were slightly increased, our mimics did not succeed in spoofing the x-vector system. Interestingly, however, the relative ordering of the selected targets (closest, furthest, median) are consistent between the systems, which suggests some level of transferability between the systems.


Unified Hypersphere Embedding for Speaker Recognition

Hajibabaei, Mahdi, Dai, Dengxin

arXiv.org Artificial Intelligence

ABSTRACT Incremental improvements in accuracy of Convolutional Neural Networks are usually achieved through use of deeper and more complex models trained on larger datasets. However, enlarging dataset and models increases the computation and storage costs and cannot be done indefinitely. In this work, we seek to improve the identification and verification accuracy of a text-independent speaker recognition system without use of extra data or deeper and more complex models by augmenting the training and testing data, finding the optimal dimensionality of embedding space and use of more discriminative loss functions. Index Terms-- speaker recognition, speaker verification, augmentation, discriminative loss function, convolutional neural networks 1. INTRODUCTION Speaker recognition is an area of research with more than 50 years of history and applications ranging from forensics and security to human-computer interaction in consumer electronics. Speaker recognition can be categorized into two tasks of text-dependent and text-independent speaker recognition with regard to the similarity of the uttered content between utterances.


On deep speaker embeddings for text-independent speaker recognition

Novoselov, Sergey, Shulipa, Andrey, Kremnev, Ivan, Kozlov, Alexandr, Shchemelinin, Vadim

arXiv.org Machine Learning

We investigate deep neural network performance in the textindependent speaker recognition task. We demonstrate that using angular softmax activation at the last classification layer of a classification neural network instead of a simple softmax activation allows to train a more generalized discriminative speaker embedding extractor. Cosine similarity is an effective metric for speaker verification in this embedding space. We also address the problem of choosing an architecture for the extractor. We found that deep networks with residual frame level connections outperform wide but relatively shallow architectures. This paper also proposes several improvements for previous DNN-based extractor systems to increase the speaker recognition accuracy. We show that the discriminatively trained similarity metric learning approach outperforms the standard LDA-PLDA method as an embedding backend. The results obtained on Speakers in the Wild and NIST SRE 2016 evaluation sets demonstrate robustness of the proposed systems when dealing with close to real-life conditions.


Automatic Speaker Recognition using Transfer Learning

#artificialintelligence

Even with today's frequent technological breakthroughs in speech-interactive devices (think Siri and Alexa), few companies have tried their hand at enabling multi-user profiles. Google Home has been the most ambitious in this area, allowing up to six user profiles. The recent boom of this technology is what made the potential for this project very exciting to our team. We also wanted to engage in a project that is still a hot topic in deep-learning research, create interesting tools, learn more about neural network architectures, and make original contributions where possible. We sought to create a system able to quickly add user profiles and accurately identify their voices with very little training data, a few sentences as most! This learning from one to only a few samples is known as One Shot Learning.