Speech Recognition
Interpretable Spectrum Transformation Attacks to Speaker Recognition
Yao, Jiadi, Luo, Hong, Zhang, Xiao-Lei
The success of adversarial attacks to speaker recognition is mainly in white-box scenarios. When applying the adversarial voices that are generated by attacking white-box surrogate models to black-box victim models, i.e. \textit{transfer-based} black-box attacks, the transferability of the adversarial voices is not only far from satisfactory, but also lacks interpretable basis. To address these issues, in this paper, we propose a general framework, named spectral transformation attack based on modified discrete cosine transform (STA-MDCT), to improve the transferability of the adversarial voices to a black-box victim model. Specifically, we first apply MDCT to the input voice. Then, we slightly modify the energy of different frequency bands for capturing the salient regions of the adversarial noise in the time-frequency domain that are critical to a successful attack. Unlike existing approaches that operate voices in the time domain, the proposed framework operates voices in the time-frequency domain, which improves the interpretability, transferability, and imperceptibility of the attack. Moreover, it can be implemented with any gradient-based attackers. To utilize the advantage of model ensembling, we not only implement STA-MDCT with a single white-box surrogate model, but also with an ensemble of surrogate models. Finally, we visualize the saliency maps of adversarial voices by the class activation maps (CAM), which offers an interpretable basis to transfer-based attacks in speaker recognition for the first time. Extensive comparison results with five representative attackers show that the CAM visualization clearly explains the effectiveness of STA-MDCT, and the weaknesses of the comparison methods; the proposed method outperforms the comparison methods by a large margin.
- Asia > China > Zhejiang Province > Hangzhou (0.04)
- Europe > Russia (0.04)
- Asia > Russia (0.04)
- (2 more...)
- Information Technology > Security & Privacy (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Pattern Recognition > Speech Recognition (0.83)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.68)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)
Probabilistic Back-ends for Online Speaker Recognition and Clustering
Sholokhov, Alexey, Kuzmin, Nikita, Lee, Kong Aik, Chng, Eng Siong
This paper focuses on multi-enrollment speaker recognition which naturally occurs in the task of online speaker clustering, and studies the properties of different scoring back-ends in this scenario. First, we show that popular cosine scoring suffers from poor score calibration with a varying number of enrollment utterances. Second, we propose a simple replacement for cosine scoring based on an extremely constrained version of probabilistic linear discriminant analysis (PLDA). The proposed model improves over the cosine scoring for multi-enrollment recognition while keeping the same performance in the case of one-to-one comparisons. Finally, we consider an online speaker clustering task where each step naturally involves multi-enrollment recognition. We propose an online clustering algorithm allowing us to take benefits from the PLDA model such as the ability to handle uncertainty and better score calibration. Our experiments demonstrate the effectiveness of the proposed algorithm.
- Asia > Singapore (0.04)
- North America > United States > New York (0.04)
- Europe > Russia > Central Federal District > Moscow Oblast > Moscow (0.04)
- (2 more...)
- Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.68)
- Information Technology > Artificial Intelligence > Machine Learning > Pattern Recognition > Speech Recognition (0.63)
The Newsbridge -Telecom SudParis VoxCeleb Speaker Recognition Challenge 2022 System Description
Tevissen, Yannis, Boudy, Jérôme, Petitpont, Frédéric
We describe the system used by our team for the VoxCeleb Speaker Recognition Challenge 2022 (VoxSRC 2022) in the speaker diarization track. Our solution was designed around a new combination of voice activity detection algorithms that uses the strengths of several systems. We introduce a novel multi stream approach with a decision protocol based on classifiers entropy. We called this method a multi-stream voice activity detection and used it with standard baseline diarization embeddings, clustering and resegmentation. With this work, we successfully demonstrated that using a strong baseline and working only on voice activity detection, one can achieved close to state-of-theart results.
- Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.69)
- Information Technology > Artificial Intelligence > Machine Learning > Pattern Recognition > Speech Recognition (0.61)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.47)
Introducing Model Inversion Attacks on Automatic Speaker Recognition
Pizzi, Karla, Boenisch, Franziska, Sahin, Ugur, Böttinger, Konstantin
Model inversion (MI) attacks allow to reconstruct average per-class representations of a machine learning (ML) model's training data. It has been shown that in scenarios where each class corresponds to a different individual, such as face classifiers, this represents a severe privacy risk. In this work, we explore a new application for MI: the extraction of speakers' voices from a speaker recognition system. We present an approach to (1) reconstruct audio samples from a trained ML model and (2) extract intermediate voice feature representations which provide valuable insights into the speakers' biometrics. Therefore, we propose an extension of MI attacks which we call sliding model inversion. Our sliding MI extends standard MI by iteratively inverting overlapping chunks of the audio samples and thereby leveraging the sequential properties of audio data for enhanced inversion performance. We show that one can use the inverted audio data to generate spoofed audio samples to impersonate a speaker, and execute voice-protected commands for highly secured systems on their behalf. To the best of our knowledge, our work is the first one extending MI attacks to audio data, and our results highlight the security risks resulting from the extraction of the biometric data in that setup.
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.96)
- Information Technology > Artificial Intelligence > Machine Learning > Pattern Recognition > Speech Recognition (0.65)
Multi-source Domain Adaptation for Text-independent Forensic Speaker Recognition
Wang, Zhenyu, Hansen, John H. L.
Adapting speaker recognition systems to new environments is a widely-used technique to improve a well-performing model learned from large-scale data towards a task-specific small-scale data scenarios. However, previous studies focus on single domain adaptation, which neglects a more practical scenario where training data are collected from multiple acoustic domains needed in forensic scenarios. Audio analysis for forensic speaker recognition offers unique challenges in model training with multi-domain training data due to location/scenario uncertainty and diversity mismatch between reference and naturalistic field recordings. It is also difficult to directly employ small-scale domain-specific data to train complex neural network architectures due to domain mismatch and performance loss. Fine-tuning is a commonly-used method for adaptation in order to retrain the model with weights initialized from a well-trained model. Alternatively, in this study, three novel adaptation methods based on domain adversarial training, discrepancy minimization, and moment-matching approaches are proposed to further promote adaptation performance across multiple acoustic domains. A comprehensive set of experiments are conducted to demonstrate that: 1) diverse acoustic environments do impact speaker recognition performance, which could advance research in audio forensics, 2) domain adversarial training learns the discriminative features which are also invariant to shifts between domains, 3) discrepancy-minimizing adaptation achieves effective performance simultaneously across multiple acoustic domains, and 4) moment-matching adaptation along with dynamic distribution alignment also significantly promotes speaker recognition performance on each domain, especially for the LENA-field domain with noise compared to all other systems.
- Europe > Denmark > North Jutland > Aalborg (0.04)
- Asia > Middle East > Jordan (0.04)
- Asia > China > Zhejiang Province > Hangzhou (0.04)
- (9 more...)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Pattern Recognition > Speech Recognition (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Automatic speaker recognition technology outperforms human listeners in the courtroom
A key question in a number of court cases is whether a speaker on an audio recording is a particular known speaker, for example, whether a speaker on a recording of an intercepted telephone call is the defendant. In most English-speaking countries, expert testimony is only admissible in a court of law if it will potentially assist the judge or the jury to make a decision. If the judge or the jury's speaker identification were equally accurate or more accurate than a forensic scientist's forensic voice comparison, then the forensic-voice-comparison testimony would not be admissible. In a research paper published in the journal Forensic Science International, a multidisciplinary international team of researchers has reported the first set of results from a comprehensive study that compares the accuracy of speaker-identification by individual listeners (like judges or jury members) with the accuracy of a forensic-voice-comparison system that is based on state-of-the-art automatic-speaker-recognition technology, and that does so using recordings that reflect the conditions of an actual case. The questioned-speaker recording was of a telephone call with background office noise, and the known-speaker recording was of a police interview conducted in echoey room with background ventilation-system noise.
- Oceania > Australia > New South Wales (0.08)
- South America > Chile (0.06)
- Europe > United Kingdom (0.06)
Universal speaker recognition encoders for different speech segments duration
Novoselov, Sergey, Volokhov, Vladimir, Lavrentyeva, Galina
Creating universal speaker encoders which are robust for different acoustic and speech duration conditions is a big challenge today. According to our observations systems trained on short speech segments are optimal for short phrase speaker verification and systems trained on long segments are superior for long segments verification. A system trained simultaneously on pooled short and long speech segments does not give optimal verification results and usually degrades both for short and long segments. This paper addresses the problem of creating universal speaker encoders for different speech segments duration. We describe our simple recipe for training universal speaker encoder for any type of selected neural network architecture. According to our evaluation results of wav2vec-TDNN based systems obtained for NIST SRE and VoxCeleb1 benchmarks the proposed universal encoder provides speaker verification improvements in case of different enrollment and test speech segment duration. The key feature of the proposed encoder is that it has the same inference time as the selected neural network architecture.
- Asia > Russia (0.05)
- Europe > Russia > Northwestern Federal District > Leningrad Oblast > Saint Petersburg (0.04)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Pattern Recognition > Speech Recognition (0.44)
Large-scale learning of generalised representations for speaker recognition
Jung, Jee-weon, Heo, Hee-Soo, Lee, Bong-Jin, Lee, Jaesong, Shim, Hye-jin, Kwon, Youngki, Chung, Joon Son, Watanabe, Shinji
The objective of this work is to develop a speaker recognition model to be used in diverse scenarios. We hypothesise that two components should be adequately configured to build such a model. First, adequate architecture would be required. We explore several recent state-of-the-art models, including ECAPA-TDNN and MFA-Conformer, as well as other baselines. Second, a massive amount of data would be required. We investigate several new training data configurations combining a few existing datasets. The most extensive configuration includes over 87k speakers' 10.22k hours of speech. Four evaluation protocols are adopted to measure how the trained model performs in diverse scenarios. Through experiments, we find that MFA-Conformer with the least inductive bias generalises the best. We also show that training with proposed large data configurations gives better performance. A boost in generalisation is observed, where the average performance on four evaluation protocols improves by more than 20%. In addition, we also demonstrate that these models' performances can improve even further when increasing capacity.
- Asia > South Korea (0.04)
- North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
- Europe > Finland > North Karelia > Joensuu (0.04)
- Asia > India (0.04)
- Information Technology > Artificial Intelligence > Machine Learning > Pattern Recognition > Speech Recognition (0.64)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.46)
The ReturnZero System for VoxCeleb Speaker Recognition Challenge 2022
In this paper, we describe the top-scoring submissions for team RTZR VoxCeleb Speaker Recognition Challenge 2022 (VoxSRC-22) in the closed dataset, speaker verification Track 1. The top performed system is a fusion of 7 models, which contains 3 different types of model architectures. We focus on training models to learn extra-temporal information. Therefore, all models were trained with 4-6 second frames for each utterance. Also, we apply the Large Margin Fine-tuning strategy which has shown good performance on the previous challenges for some of our fusion models. While the evaluation process, we apply the scoring methods with adaptive symmetric normalization (AS-Norm) and matrix score average (MSA). Finally, we mix up models with logistic regression to fuse all the trained models. The final submission achieves 0.165 DCF and 2.912% EER on the VoxSRC22 test set.
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.69)
- Information Technology > Artificial Intelligence > Machine Learning > Pattern Recognition > Speech Recognition (0.66)
The Royalflush System for VoxCeleb Speaker Recognition Challenge 2022
Tian, Jingguang, Hu, Xinhui, Xu, Xinkang
In this technical report, we describe the Royalflush submissions for the VoxCeleb Speaker Recognition Challenge 2022 (VoxSRC-22). Our submissions contain track 1, which is for supervised speaker verification and track 3, which is for semi-supervised speaker verification. For track 1, we develop a powerful U-Net-based speaker embedding extractor with a symmetric architecture. The proposed system achieves 2.06% in EER and 0.1293 in MinDCF on the validation set. Compared with the state-of-the-art ECAPA-TDNN, it obtains a relative improvement of 20.7% in EER and 22.70% in MinDCF. For track 3, we employ the joint training of source domain supervision and target domain self-supervision to get a speaker embedding extractor. The subsequent clustering process can obtain target domain pseudo-speaker labels. We adapt the speaker embedding extractor using all source and target domain data in a supervised manner, where it can fully leverage both domain information. Moreover, clustering and supervised domain adaptation can be repeated until the performance converges on the validation set. Our final submission is a fusion of 10 models and achieves 7.75% EER and 0.3517 MinDCF on the validation set.
- Information Technology > Artificial Intelligence > Machine Learning > Pattern Recognition > Speech Recognition (0.62)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.47)