Speech Recognition
The NIST CTS Speaker Recognition Challenge
Sadjadi, Seyed Omid, Greenberg, Craig, Singer, Elliot, Mason, Lisa, Reynolds, Douglas
The US National Institute of Standards and Technology (NIST) has been conducting a second iteration of the CTS challenge since August 2020. The current iteration of the CTS Challenge is a leaderboard-style speaker recognition evaluation using telephony data extracted from the unexposed portions of the Call My Net 2 (CMN2) and Multi-Language Speech (MLS) corpora collected by the LDC. The CTS Challenge is currently organized in a similar manner to the SRE19 CTS Challenge, offering only an open training condition using two evaluation subsets, namely Progress and Test. Unlike in the SRE19 Challenge, no training or development set was initially released, and NIST has publicly released the leaderboards on both subsets for the CTS Challenge. Which subset (i.e., Progress or Test) a trial belongs to is unknown to challenge participants, and each system submission needs to contain outputs for all of the trials. The CTS Challenge has also served, and will continue to do so, as a prerequisite for entrance to the regular SREs (such as SRE21). Since August 2020, a total of 53 organizations (forming 33 teams) from academia and industry have participated in the CTS Challenge and submitted more than 4400 valid system outputs. This paper presents an overview of the evaluation and several analyses of system performance for some primary conditions in the CTS Challenge. The CTS Challenge results thus far indicate remarkable improvements in performance due to 1) speaker embeddings extracted using large-scale and complex neural network architectures such as ResNets along with angular margin losses for speaker embedding extraction, 2) extensive data augmentation, 3) the use of large amounts of in-house proprietary data from a large number of labeled speakers, 4) long-duration fine-tuning.
- North America > United States (1.00)
- Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.14)
- Europe > Slovenia (0.04)
- (4 more...)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.93)
- Information Technology > Artificial Intelligence > Machine Learning > Pattern Recognition > Speech Recognition (0.65)
Multi-View Self-Attention Based Transformer for Speaker Recognition
Wang, Rui, Ao, Junyi, Zhou, Long, Liu, Shujie, Wei, Zhihua, Ko, Tom, Li, Qing, Zhang, Yu
Initially developed for natural language processing (NLP), Transformer model is now widely used for speech processing tasks such as speaker recognition, due to its powerful sequence modeling capabilities. However, conventional self-attention mechanisms are originally designed for modeling textual sequence without considering the characteristics of speech and speaker modeling. Besides, different Transformer variants for speaker recognition have not been well studied. In this work, we propose a novel multi-view self-attention mechanism and present an empirical study of different Transformer variants with or without the proposed attention mechanism for speaker recognition. Specifically, to balance the capabilities of capturing global dependencies and modeling the locality, we propose a multi-view self-attention mechanism for speaker Transformer, in which different attention heads can attend to different ranges of the receptive field. Furthermore, we introduce and compare five Transformer variants with different network architectures, embedding locations, and pooling methods to learn speaker embeddings. Experimental results on the VoxCeleb1 and VoxCeleb2 datasets show that the proposed multi-view self-attention mechanism achieves improvement in the performance of speaker recognition, and the proposed speaker Transformer network attains excellent results compared with state-of-the-art models.
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Pattern Recognition > Speech Recognition (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.95)
Curricular SincNet: Towards Robust Deep Speaker Recognition by Emphasizing Hard Samples in Latent Space
Chowdhury, Labib, Kamal, Mustafa, Hasan, Najia, Mohammed, Nabeel
Deep learning models have become an increasingly preferred option for biometric recognition systems, such as speaker recognition. SincNet, a deep neural network architecture, gained popularity in speaker recognition tasks due to its parameterized sinc functions that allow it to work directly on the speech signal. The original SincNet architecture uses the softmax loss, which may not be the most suitable choice for recognition-based tasks. Such loss functions do not impose inter-class margins nor differentiate between easy and hard training samples. Curriculum learning, particularly those leveraging angular margin-based losses, has proven very successful in other biometric applications such as face recognition. The advantage of such a curriculum learning-based techniques is that it will impose inter-class margins as well as taking to account easy and hard samples. In this paper, we propose Curricular SincNet (CL-SincNet), an improved SincNet model where we use a curricular loss function to train the SincNet architecture. The proposed model is evaluated on multiple datasets using intra-dataset and inter-dataset evaluation protocols. In both settings, the model performs competitively with other previously published work. In the case of inter-dataset testing, it achieves the best overall results with a reduction of 4\% error rate compare to SincNet and other published work.
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.89)
- Information Technology > Artificial Intelligence > Machine Learning > Pattern Recognition > Speech Recognition (0.83)
NIST SRE CTS Superset: A large-scale dataset for telephony speaker recognition
This document provides a brief description of the National Institute of Standards and Technology (NIST) speaker recognition evaluation (SRE) conversational telephone speech (CTS) Superset. The CTS Superset has been created in an attempt to provide the research community with a large-scale dataset along with uniform metadata that can be used to effectively train and develop telephony (narrowband) speaker recognition systems. It contains a large number of telephony speech segments from more than 6800 speakers with speech durations distributed uniformly in the [10s, 60s] range. The segments have been extracted from the source corpora used to compile prior SRE datasets (SRE1996-2012), including the Greybeard corpus as well as the Switchboard and Mixer series collected by the Linguistic Data Consortium (LDC). In addition to the brief description, we also report speaker recognition results on the NIST 2020 CTS Speaker Recognition Challenge, obtained using a system trained with the CTS Superset. The results will serve as a reference baseline for the challenge.
- North America > United States (0.35)
- Europe > Belgium > Flanders > Antwerp Province > Antwerp (0.04)
Improving Fairness in Speaker Recognition
Fenu, Gianni, Medda, Giacomo, Marras, Mirko, Meloni, Giacomo
The human voice conveys unique characteristics of an individual, making voice biometrics a key technology for verifying identities in various industries. Despite the impressive progress of speaker recognition systems in terms of accuracy, a number of ethical and legal concerns has been raised, specifically relating to the fairness of such systems. In this paper, we aim to explore the disparity in performance achieved by state-of-the-art deep speaker recognition systems, when different groups of individuals characterized by a common sensitive attribute (e.g., gender) are considered. In order to mitigate the unfairness we uncovered by means of an exploratory study, we investigate whether balancing the representation of the different groups of individuals in the training set can lead to a more equal treatment of these demographic groups. Experiments on two state-of-the-art neural architectures and a large-scale public dataset show that models trained with demographically-balanced training sets exhibit a fairer behavior on different groups, while still being accurate. Our study is expected to provide a solid basis for instilling beyond-accuracy objectives (e.g., fairness) in speaker recognition.
- Materials > Chemicals > Industrial Gases > Liquified Gas (0.37)
- Materials > Chemicals > Commodity Chemicals > Petrochemicals > LNG (0.37)
- Energy > Oil & Gas > Midstream (0.37)
- Law (0.34)
- Information Technology > Artificial Intelligence > Machine Learning > Pattern Recognition > Speech Recognition (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.67)
EfficientTDNN: Efficient Architecture Search for Speaker Recognition in the Wild
Wang, Rui, Wei, Zhihua, Ji, Shouling, Hong, Zhen
Speaker recognition refers to audio biometrics that utilizes acoustic characteristics for automatic speaker recognition. These systems have emerged as an essential means of verifying identity in various scenarios, such as smart homes, general business interactions, e-commerce applications, and forensics. However, the mismatch between training and real-world data causes a shift of speaker embedding space and severely degrades the recognition performance. Various complicated neural architectures are presented to address speaker recognition in the wild but neglect the requirements of storage and computation. To address this issue, we propose a neural architecture search-based efficient time-delay neural network (EfficientTDNN) to improve inference efficiency while maintaining recognition accuracy. The proposed EfficientTDNN contains three phases. First, supernet design is to construct a dynamic neural architecture that consists of sequential cells and enables network pruning. Second, progressive training is to optimize randomly sampled subnets that inherit the weights of the supernet. Third, three search methods, including manual grid search, random search, and model predictive evolutionary search, are introduced to find a trade-off between accuracy and efficiency. Results of experiments on the VoxCeleb dataset show EfficientTDNN provides a huge search space including approximately $10^{13}$ subnets and achieves 1.66% EER and 0.156 DCF$_{0.01}$ with 565M MACs. Comprehensive investigation suggests that the trained supernet generalizes cells unseen during training and obtains an acceptable balance between accuracy and efficiency.
- Information Technology > Artificial Intelligence > Representation & Reasoning > Search (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Pattern Recognition > Speech Recognition (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Remarks on Optimal Scores for Speaker Recognition
In this article, we first establish the theory of optimal scores for speaker recognition. Our analysis shows that the minimum Bayes risk (MBR) decisions for both the speaker identification and speaker verification tasks can be based on a normalized likelihood (NL). When the underlying generative model is a linear Gaussian, the NL score is mathematically equivalent to the PLDA likelihood ratio, and the empirical scores based on cosine distance and Euclidean distance can be seen as approximations of this linear Gaussian NL score under some conditions. We discuss a number of properties of the NL score and perform a simple simulation experiment to demonstrate the properties of the NL score.
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Speech > Acoustic Processing (0.70)
- Information Technology > Artificial Intelligence > Machine Learning > Pattern Recognition > Speech Recognition (0.62)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)
A Lightweight Speaker Recognition System Using Timbre Properties
Ohi, Abu Quwsar, Mridha, M. F., Hamid, Md. Abdul, Monowar, Muhammad Mostafa, Lee, Dongsu, Kim, Jinsul
Speaker recognition is an active research area that contains notable usage in biometric security and authentication system. Currently, there exist many well-performing models in the speaker recognition domain. However, most of the advanced models implement deep learning that requires GPU support for real-time speech recognition, and it is not suitable for low-end devices. In this paper, we propose a lightweight text-independent speaker recognition model based on random forest classifier. It also introduces new features that are used for both speaker verification and identification tasks. The proposed model uses human speech based timbral properties as features that are classified using random forest. Timbre refers to the very basic properties of sound that allow listeners to discriminate among them. The prototype uses seven most actively searched timbre properties, boominess, brightness, depth, hardness, roughness, sharpness, and warmth as features of our speaker recognition model. The experiment is carried out on speaker verification and speaker identification tasks and shows the achievements and drawbacks of the proposed model. In the speaker identification phase, it achieves a maximum accuracy of 78%. On the contrary, in the speaker verification phase, the model maintains an accuracy of 80% having an equal error rate (ERR) of 0.24.
- Asia > South Korea > Gwangju > Gwangju (0.04)
- Asia > Middle East > Saudi Arabia > Mecca Province > Jeddah (0.04)
- Asia > Bangladesh > Dhaka Division > Dhaka District > Dhaka (0.04)
- Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
- Information Technology > Artificial Intelligence > Speech > Acoustic Processing (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Pattern Recognition > Speech Recognition (1.00)
A Machine of Few Words -- Interactive Speaker Recognition with Reinforcement Learning
Speaker recognition is a well known and studied task in the speech processing domain. It has many applications, either for security or speaker adaptation of personal devices. In this paper, we present a new paradigm for automatic speaker recognition that we call Interactive Speaker Recognition (ISR). In this paradigm, the recognition system aims to incrementally build a representation of the speakers by requesting personalized utterances to be spoken in contrast to the standard text-dependent or text-independent schemes. To do so, we cast the speaker recognition task into a sequential decision-making problem that we solve with Reinforcement Learning. Using a standard dataset, we show that our method achieves excellent performance while using little speech signal amounts. This method could also be applied as an utterance selection mechanism for building speech synthesis systems.
- Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Pattern Recognition > Speech Recognition (1.00)
Meta-Learning for Short Utterance Speaker Recognition with Imbalance Length Pairs
Kye, Seong Min, Jung, Youngmoon, Lee, Hae Beom, Hwang, Sung Ju, Kim, Hoirin
In realistic settings, a speaker recognition system needs to identify a speaker given a short utterance, while the utterance used to enroll may be relatively long. However, existing speaker recognition models perform poorly with such short utterances. To solve this problem, we introduce a meta-learning scheme with imbalance length pairs. Specifically, we use a prototypical network and train it with a support set of long utterances and a query set of short utterances. However, since optimizing for only the classes in the given episode is not sufficient to learn discriminative embeddings for other classes in the entire dataset, we additionally classify both support set and query set against the entire classes in the training set to learn a well-discriminated embedding space. By combining these two learning schemes, our model outperforms existing state-of-the-art speaker verification models learned in a standard supervised learning framework on short utterance (1-2 seconds) on VoxCeleb dataset. We also validate our proposed model for unseen speaker identification, on which it also achieves significant gain over existing approaches.
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Pattern Recognition > Speech Recognition (0.84)
- Information Technology > Artificial Intelligence > Speech > Acoustic Processing (0.72)