speaker identification
Modular Deep Learning Framework for Assistive Perception: Gaze, Affect, and Speaker Identification
Anchan, Akshit Pramod, Thomas, Jewelith, Roy, Sritama
Developing comprehensive assistive technologies requires the seamless integration of visual and auditory perception. This research evaluates the feasibility of a modular architecture inspired by core functionalities of perceptive systems like 'Smart Eye.' We propose and benchmark three independent sensing modules: a Convolutional Neural Network (CNN) for eye state detection (drowsiness/attention), a deep CNN for facial expression recognition, and a Long Short-Term Memory (LSTM) network for voice-based speaker identification. Utilizing the Eyes Image, FER2013, and customized audio datasets, our models achieved accuracies of 93.0%, 97.8%, and 96.89%, respectively. This study demonstrates that lightweight, domain-specific models can achieve high fidelity on discrete tasks, establishing a validated foundation for future real-time, multimodal integration in resource-constrained assistive devices.
- Asia > India > Tamil Nadu > Chennai (0.05)
- North America > United States (0.05)
- Europe > Czechia > Moravian-Silesian Region > Ostrava (0.04)
- North America > United States (0.14)
- Europe > Italy (0.14)
- Asia (0.14)
- Government (0.46)
- Media (0.46)
Multi-Target Backdoor Attacks Against Speaker Recognition
Fortier, Alexandrine, Joshi, Sonal, Thebaud, Thomas, Villalba, Jesús, Dehak, Najim, Cardinal, Patrick
--In this work, we propose a multi-target backdoor attack against speaker identification using position-independent clicking sounds as triggers. T o simulate more realistic attack conditions, we vary the signal-to-noise ratio between speech and trigger, demonstrating a trade-off between stealth and effectiveness. We further extend the attack to the speaker verification task by selecting the most similar training speaker--based on cosine similarity--as a proxy target. The attack is most effective when target and enrolled speaker pairs are highly similar, reaching success rates of up to 90% in such cases. In recent years, speaker recognition systems have achieved strong performance. However, they remain susceptible to significant security risks, including malicious attacks [1]-[6]. Due to constraints in data and computational resources, many organizations rely on external providers for model development or data collection. A particularly concerning threat is backdoor attacks, which are introduced during training. The backdoor itself is a hidden mechanism the model learns during training: when a specific input pattern--known as a trigger--is present, the model consistently produces a target output, regardless of the true input.
- Information Technology > Security & Privacy (1.00)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Speech > Acoustic Processing (0.73)
- Information Technology > Artificial Intelligence > Machine Learning > Pattern Recognition > Speech Recognition (0.62)
From Dialect Gaps to Identity Maps: Tackling Variability in Speaker Verification
Abdullah, Abdulhady Abas, Badawi, Soran, Abdullah, Dana A., Hamad, Dana Rasul
The complexity and difficulties of Kurdish speaker detection among its several dialects are investigated in this work. Because of its great phonetic and lexical differences, Kurdish with several dialects including Kurmanji, Sorani, and Hawrami offers special challenges for speaker recognition systems. The main difficulties in building a strong speaker identification system capable of precisely identifying speakers across several dialects are investigated in this work. To raise the accuracy and dependability of these systems, it also suggests solutions like sophisticated machine learning approaches, data augmentation tactics, and the building of thorough dialect-specific corpus. The results show that customized strategies for every dialect together with cross-dialect training greatly enhance recognition performance.
- Asia > Middle East > Republic of Türkiye (0.05)
- Asia > Middle East > Syria (0.05)
- Asia > Middle East > Iraq > Erbil Governorate > Erbil (0.04)
- (3 more...)
- Government (0.68)
- Information Technology (0.67)
- Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
- Information Technology > Artificial Intelligence > Speech > Acoustic Processing (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Text-Independent Speaker Identification Using Audio Looping With Margin Based Loss Functions
Garcia, Elliot Q C, Vilela, Nicéias Silva, Sacramento, Kátia Pires Nascimento do, Ferreira, Tiago A. E.
Speaker identification has become a crucial component in various applications, including security systems, virtual assistants, and personalized user experiences. In this paper, we investigate the effectiveness of CosFace Loss and ArcFace Loss for text-independent speaker identification using a Convolutional Neural Network architecture based on the VGG16 model, modified to accommodate mel spectrogram inputs of variable sizes generated from the Voxceleb1 dataset. Our approach involves implementing both loss functions to analyze their effects on model accuracy and robustness, where the Softmax loss function was employed as a comparative baseline. Additionally, we examine how the sizes of mel spectrograms and their varying time lengths influence model performance. The experimental results demonstrate superior identification accuracy compared to traditional Softmax loss methods. Furthermore, we discuss the implications of these findings for future research.
Speech Unlearning
We introduce machine unlearning for speech tasks, a novel and underexplored research problem that aims to efficiently and effectively remove the influence of specific data from trained speech models without full retraining. This has important applications in privacy preservation, removal of outdated or noisy data, and bias mitigation. While machine unlearning has been studied in computer vision and natural language processing, its application to speech is largely unexplored due to the high-dimensional, sequential, and speaker-dependent nature of speech data. We define two fundamental speech unlearning tasks: sample unlearning, which removes individual data points (e.g., a voice recording), and class unlearning, which removes an entire category (e.g., all data from a speaker), while preserving performance on the remaining data. Experiments on keyword spotting and speaker identification demonstrate that unlearning speech data is significantly more challenging than unlearning image or text data. We conclude with key future directions in this area, including structured training, robust evaluation, feature-level unlearning, broader applications, scalable methods, and adversarial robustness.
- North America > United States > Massachusetts > Middlesex County > Lowell (0.40)
- Europe > Switzerland (0.04)
Whisper Speaker Identification: Leveraging Pre-Trained Multilingual Transformers for Robust Speaker Embeddings
Emon, Jakaria Islam, Salek, Md Abu, Alam, Kazi Tamanna
Speaker identification in multilingual settings presents unique challenges, particularly when conventional models are predominantly trained on English data. In this paper, we propose WSI (Whisper Speaker Identification), a framework that repurposes the encoder of the Whisper automatic speech recognition model pre trained on extensive multilingual data to generate robust speaker embeddings via a joint loss optimization strategy that leverages online hard triplet mining and self supervised Normalized Temperature-scaled Cross Entropy loss. By capitalizing on Whisper language-agnostic acoustic representations, our approach effectively distinguishes speakers across diverse languages and recording conditions. Extensive evaluations on multiple corpora, including VoxTube (multilingual), JVS (Japanese), CallHome (German, Spanish, Chinese, and Japanese), and Voxconverse (English), demonstrate that WSI consistently outperforms state-of-the-art baselines, namely Pyannote Embedding, ECAPA TDNN, and Xvector, in terms of lower equal error rates and higher AUC scores. These results validate our hypothesis that a multilingual pre-trained ASR encoder, combined with joint loss optimization, substantially improves speaker identification performance in non-English languages.
- Asia > Japan > Hokkaidō > Hokkaidō Prefecture > Sapporo (0.04)
- Asia > Bangladesh (0.04)
- Information Technology > Artificial Intelligence > Speech > Acoustic Processing (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.94)
- Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.90)
Friends-MMC: A Dataset for Multi-modal Multi-party Conversation Understanding
Wang, Yueqian, Meng, Xiaojun, Wang, Yuxuan, Liang, Jianxin, Liu, Qun, Zhao, Dongyan
Multi-modal multi-party conversation (MMC) is a less studied yet important topic of research due to that it well fits real-world scenarios and thus potentially has more widely-used applications. Compared with the traditional multi-modal conversations, MMC requires stronger character-centered understanding abilities as there are many interlocutors appearing in both the visual and textual context. To facilitate the study of this problem, we present Friends-MMC in this paper, an MMC dataset that contains 24,000+ unique utterances paired with video context. To explore the character-centered understanding of the dialogue, we also annotate the speaker of each utterance, the names and bounding bboxes of faces that appear in the video. Based on this Friends-MMC dataset, we further study two fundamental MMC tasks: conversation speaker identification and conversation response prediction, both of which have the multi-party nature with the video or image as visual context. For conversation speaker identification, we demonstrate the inefficiencies of existing methods such as pre-trained models, and propose a simple yet effective baseline method that leverages an optimization solver to utilize the context of two modalities to achieve better performance. For conversation response prediction, we fine-tune generative dialogue models on Friend-MMC, and analyze the benefits of speaker information. The code and dataset is publicly available at https://github.com/yellow-binary-tree/Friends-MMC and thus we call for more attention on modeling speaker information when understanding conversations.