speaker recognition
- Asia > China > Hong Kong (0.04)
- North America > United States > Massachusetts (0.04)
- Asia > Singapore > Central Region > Singapore (0.04)
- Asia > China > Guangdong Province > Shenzhen (0.04)
Disentangling Voice and Content with Self-Supervision for Speaker Recognition
For speaker recognition, it is difficult to extract an accurate speaker representation from speech because of its mixture of speaker traits and content. This paper proposes a disentanglement framework that simultaneously models speaker traits and content variability in speech. It is realized with the use of three Gaussian inference layers, each consisting of a learnable transition model that extracts distinct speech components. Notably, a strengthened transition model is specifically designed to model complex speech dynamics. We also propose a self-supervision method to dynamically disentangle content without the use of labels other than speaker identities. The efficacy of the proposed framework is validated via experiments conducted on the VoxCeleb and SITW datasets with 9.56\% and 8.24\% average reductions in EER and minDCF, respectively. Since neither additional model training nor data is specifically needed, it is easily applicable in practical use.
VoiceBlock: Privacy through Real-Time Adversarial Attacks with Audio-to-Audio Models
As governments and corporations adopt deep learning systems to collect and analyze user-generated audio data, concerns about security and privacy naturally emerge in areas such as automatic speaker recognition. While audio adversarial examples offer one route to mislead or evade these invasive systems, they are typically crafted through time-intensive offline optimization, limiting their usefulness in streaming contexts. Inspired by architectures for audio-to-audio tasks such as denoising and speech enhancement, we propose a neural network model capable of adversarially modifying a user's audio stream in real-time. Our model learns to apply a time-varying finite impulse response (FIR) filter to outgoing audio, allowing for effective and inconspicuous perturbations on a small fixed delay suitable for streaming tasks. We demonstrate our model is highly effective at de-identifying user speech from speaker recognition and able to transfer to an unseen recognition system. We conduct a perceptual study and find that our method produces perturbations significantly less perceptible than baseline anonymization methods, when controlling for effectiveness. Finally, we provide an implementation of our model capable of running in real-time on a single CPU thread.
- Asia > China > Hong Kong (0.04)
- North America > United States > Massachusetts (0.04)
- Asia > Singapore > Central Region > Singapore (0.04)
- Asia > China > Guangdong Province > Shenzhen (0.04)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
- Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.72)
- Information Technology > Artificial Intelligence > Machine Learning > Pattern Recognition > Speech Recognition (0.63)
From Dialect Gaps to Identity Maps: Tackling Variability in Speaker Verification
Abdullah, Abdulhady Abas, Badawi, Soran, Abdullah, Dana A., Hamad, Dana Rasul
The complexity and difficulties of Kurdish speaker detection among its several dialects are investigated in this work. Because of its great phonetic and lexical differences, Kurdish with several dialects including Kurmanji, Sorani, and Hawrami offers special challenges for speaker recognition systems. The main difficulties in building a strong speaker identification system capable of precisely identifying speakers across several dialects are investigated in this work. To raise the accuracy and dependability of these systems, it also suggests solutions like sophisticated machine learning approaches, data augmentation tactics, and the building of thorough dialect-specific corpus. The results show that customized strategies for every dialect together with cross-dialect training greatly enhance recognition performance.
- Asia > Middle East > Republic of Türkiye (0.05)
- Asia > Middle East > Syria (0.05)
- Asia > Middle East > Iraq > Erbil Governorate > Erbil (0.04)
- (3 more...)
- Government (0.68)
- Information Technology (0.67)
- Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
- Information Technology > Artificial Intelligence > Speech > Acoustic Processing (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Cross-Modal Distillation For Widely Differing Modalities
Zhao, Cairong, Jin, Yufeng, Song, Zifan, Chen, Haonan, Miao, Duoqian, Hu, Guosheng
Abstract--Deep learning achieved great progress recently, however, it is not easy or efficient to further improve its performance by increasing the size of the model. Multi-modal learning can mitigate this challenge by introducing richer and more discriminative information as input. T o solve the problem of limited access to multi-modal data at the time of use, we conduct multi-modal learning by introducing a teacher model to transfer discriminative knowledge to a student model during training. However, this knowledge transfer via distillation is not trivial because the big domain gap between the widely differing modalities can easily lead to overfitting. In this work, we introduce a cross-modal distillation framework. Specifically, we find hard constrained loss, e.g. T o address this, we propose two soft constrained knowledge distillation strategies at the feature level and classifier level respectively . In addition, we propose a quality-based adaptive weights module to weigh input samples via quantified data quality, leading to robust model training. We conducted experiments on speaker recognition and image classification tasks, and the results show that our approach is able to effectively achieve knowledge transfer between the commonly used and widely differing modalities of image, text, and speech. The rapid advancement of deep learning has revolutionized numerous fields by enabling the development of increasingly complex and powerful models. However, as model sizes continue to grow, the marginal benefits of scaling up models diminish, prompting researchers to explore alternative strategies for improving performance. One such strategy is multi-modal learning, which leverages the complementary strengths of multiple data modalities--such as images, speech, and text--to enhance task performance. While multi-modal learning has shown promise in various applications, its practical adoption is often hindered by the high cost and complexity of acquiring and processing multi-modal data. This limitation raises a critical question: how can we effectively utilize multi-modal data during training when only uni-modal data is available during deployment? T o address this challenge, we propose a novel framework for cross-modal knowledge distillation, which enables the transfer of knowledge from a strong modality (e.g., images) to a weak modality (e.g., speech) during training, even when only the weak modality is available during inference.
- Europe > United Kingdom (0.04)
- Asia > China > Zhejiang Province > Hangzhou (0.04)
- Asia > China > Shanghai > Shanghai (0.04)
SVeritas: Benchmark for Robust Speaker Verification under Diverse Conditions
Baali, Massa, Bisht, Sarthak, Teixeira, Francisco, Shapovalenko, Kateryna, Singh, Rita, Raj, Bhiksha
Speaker verification (SV) models are increasingly integrated into security, personalization, and access control systems, yet their robustness to many real-world challenges remains inadequately benchmarked. These include a variety of natural and maliciously created conditions causing signal degradations or mismatches between enrollment and test data, impacting performance. Existing benchmarks evaluate only subsets of these conditions, missing others entirely. We introduce SVeritas, a comprehensive Speaker Verification tasks benchmark suite, assessing SV systems under stressors like recording duration, spontaneity, content, noise, microphone distance, reverberation, channel mismatches, audio bandwidth, codecs, speaker age, and susceptibility to spoofing and adversarial attacks. While several benchmarks do exist that each cover some of these issues, SVeritas is the first comprehensive evaluation that not only includes all of these, but also several other entirely new, but nonetheless important, real-life conditions that have not previously been benchmarked. We use SVeritas to evaluate several state-of-the-art SV models and observe that while some architectures maintain stability under common distortions, they suffer substantial performance degradation in scenarios involving cross-language trials, age mismatches, and codec-induced compression. Extending our analysis across demographic subgroups, we further identify disparities in robustness across age groups, gender, and linguistic backgrounds. By standardizing evaluation under realistic and synthetic stress conditions, SVeritas enables precise diagnosis of model weaknesses and establishes a foundation for advancing equitable and reliable speaker verification systems.
- North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
- Europe > Portugal > Lisbon > Lisbon (0.04)
- Asia (0.04)
- Research Report > Experimental Study (0.69)
- Research Report > New Finding (0.46)
Multi-stream Convolutional Neural Network with Frequency Selection for Robust Speaker Verification
Yao, Wei, Chen, Shen, Cui, Jiamin, Lou, Yaolin
Speaker verification aims to verify whether an input speech corresponds to the claimed speaker, and conventionally, this kind of system is deployed based on single-stream scenario, wherein the feature extractor operates in full frequency range. In this paper, we hypothesize that machine can learn enough knowledge to do classification task when listening to partial frequency range instead of full frequency range, which is so called frequency selection technique, and further propose a novel framework of multi-stream Convolutional Neural Network (CNN) with this technique for speaker verification tasks. The proposed framework accommodates diverse temporal embeddings generated from multiple streams to enhance the robustness of acoustic modeling. For the diversity of temporal embeddings, we consider feature augmentation with frequency selection, which is to manually segment the full-band of frequency into several sub-bands, and the feature extractor of each stream can select which sub-bands to use as target frequency domain. Different from conventional single-stream solution wherein each utterance would only be processed for one time, in this framework, there are multiple streams processing it in parallel. The input utterance for each stream is pre-processed by a frequency selector within specified frequency range, and post-processed by mean normalization. The normalized temporal embeddings of each stream will flow into a pooling layer to generate fused embeddings. We conduct extensive experiments on VoxCeleb dataset, and the experimental results demonstrate that multi-stream CNN significantly outperforms single-stream baseline with 20.53 % of relative improvement in minimum Decision Cost Function (minDCF).
- Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.14)
- Asia > China > Zhejiang Province > Hangzhou (0.04)
- Asia > China > Shanghai > Shanghai (0.04)
- (19 more...)
- Energy (0.93)
- Information Technology > Security & Privacy (0.46)
Interpolating Speaker Identities in Embedding Space for Data Expansion
Liu, Tianchi, Tao, Ruijie, Wang, Qiongqiong, Jiang, Yidi, Sailor, Hardik B., Zhang, Ke, Lin, Jingru, Li, Haizhou
The success of deep learning-based speaker verification systems is largely attributed to access to large-scale and diverse speaker identity data. However, collecting data from more identities is expensive, challenging, and often limited by privacy concerns. To address this limitation, we propose INSIDE (Interpolating Speaker Identities in Embedding Space), a novel data expansion method that synthesizes new speaker identities by interpolating between existing speaker embeddings. Specifically, we select pairs of nearby speaker embeddings from a pretrained speaker embedding space and compute intermediate embeddings using spherical linear interpolation. These interpolated embeddings are then fed to a text-to-speech system to generate corresponding speech waveforms. The resulting data is combined with the original dataset to train downstream models. Experiments show that models trained with INSIDE-expanded data outperform those trained only on real data, achieving 3.06\% to 5.24\% relative improvements. While INSIDE is primarily designed for speaker verification, we also validate its effectiveness on gender classification, where it yields a 13.44\% relative improvement. Moreover, INSIDE is compatible with other augmentation techniques and can serve as a flexible, scalable addition to existing training pipelines.
- North America > United States (0.14)
- Asia > China > Guangdong Province > Shenzhen (0.05)
- Asia > Singapore > Central Region > Singapore (0.04)
- (2 more...)
SSPS: Self-Supervised Positive Sampling for Robust Self-Supervised Speaker Verification
Self-Supervised Learning (SSL) has led to considerable progress in Speaker Verification (SV). The standard framework uses same-utterance positive sampling and data-augmentation to generate anchor-positive pairs of the same speaker. This is a major limitation, as this strategy primarily encodes channel information from the recording condition, shared by the anchor and positive. We propose a new positive sampling technique to address this bottleneck: Self-Supervised Positive Sampling (SSPS). For a given anchor, SSPS aims to find an appropriate positive, i.e., of the same speaker identity but a different recording condition, in the latent space using clustering assignments and a memory queue of positive embeddings. SSPS improves SV performance for both SimCLR and DINO, reaching 2.57% and 2.53% EER, outperforming SOTA SSL methods on VoxCeleb1-O. In particular, SimCLR-SSPS achieves a 58% EER reduction by lowering intra-speaker variance, providing comparable performance to DINO-SSPS.