AITopics | speaker identification

Collaborating Authors

speaker identification

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

CoMix: A Comprehensive Benchmark for Multi-T ask Comic Understanding Supplementary Materials

Neural Information Processing SystemsOct-10-2025, 22:37:43 GMT

These documents have not been selected following the "most downloaded" principle, as instead is

annotation, comix, dataset, (16 more...)

Neural Information Processing Systems

Country: Asia > Middle East > Jordan (0.04)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.69)

Add feedback

CoMix: A Comprehensive Benchmark for Multi-Task Comic Understanding

Neural Information Processing SystemsOct-10-2025, 22:37:39 GMT

The comic domain is rapidly advancing with the development of single-page analysis and synthesis models.

large language model, machine learning, natural language, (21 more...)

Neural Information Processing Systems

Country:

North America > United States (0.14)
Europe > Italy (0.14)
Asia (0.14)

Genre: Research Report > Experimental Study (0.93)

Industry:

Government (0.46)
Media (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Multi-Target Backdoor Attacks Against Speaker Recognition

Fortier, Alexandrine, Joshi, Sonal, Thebaud, Thomas, Villalba, Jesús, Dehak, Najim, Cardinal, Patrick

arXiv.org Artificial IntelligenceOct-10-2025

--In this work, we propose a multi-target backdoor attack against speaker identification using position-independent clicking sounds as triggers. T o simulate more realistic attack conditions, we vary the signal-to-noise ratio between speech and trigger, demonstrating a trade-off between stealth and effectiveness. We further extend the attack to the speaker verification task by selecting the most similar training speaker--based on cosine similarity--as a proxy target. The attack is most effective when target and enrolled speaker pairs are highly similar, reaching success rates of up to 90% in such cases. In recent years, speaker recognition systems have achieved strong performance. However, they remain susceptible to significant security risks, including malicious attacks [1]-[6]. Due to constraints in data and computational resources, many organizations rely on external providers for model development or data collection. A particularly concerning threat is backdoor attacks, which are introduced during training. The backdoor itself is a hidden mechanism the model learns during training: when a specific input pattern--known as a trigger--is present, the model consistently produces a target output, regardless of the true input.

machine learning, natural language, pattern recognition, (19 more...)

arXiv.org Artificial Intelligence

2508.08559

Country: North America (0.28)

Genre: Research Report (0.65)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Speech > Acoustic Processing (0.73)
Information Technology > Artificial Intelligence > Machine Learning > Pattern Recognition > Speech Recognition (0.62)

Add feedback

From Dialect Gaps to Identity Maps: Tackling Variability in Speaker Verification

Abdullah, Abdulhady Abas, Badawi, Soran, Abdullah, Dana A., Hamad, Dana Rasul

arXiv.org Artificial IntelligenceOct-7-2025

The complexity and difficulties of Kurdish speaker detection among its several dialects are investigated in this work. Because of its great phonetic and lexical differences, Kurdish with several dialects including Kurmanji, Sorani, and Hawrami offers special challenges for speaker recognition systems. The main difficulties in building a strong speaker identification system capable of precisely identifying speakers across several dialects are investigated in this work. To raise the accuracy and dependability of these systems, it also suggests solutions like sophisticated machine learning approaches, data augmentation tactics, and the building of thorough dialect-specific corpus. The results show that customized strategies for every dialect together with cross-dialect training greatly enhance recognition performance.

artificial intelligence, deep learning, machine learning, (20 more...)

arXiv.org Artificial Intelligence

2505.04629

Country: Asia > Middle East > Iraq > Kurdistan Region (0.29)

Genre: Research Report > New Finding (0.87)

Industry:

Government (0.68)
Information Technology (0.67)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Speech > Acoustic Processing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Text-Independent Speaker Identification Using Audio Looping With Margin Based Loss Functions

Garcia, Elliot Q C, Vilela, Nicéias Silva, Sacramento, Kátia Pires Nascimento do, Ferreira, Tiago A. E.

arXiv.org Artificial IntelligenceSep-30-2025

Speaker identification has become a crucial component in various applications, including security systems, virtual assistants, and personalized user experiences. In this paper, we investigate the effectiveness of CosFace Loss and ArcFace Loss for text-independent speaker identification using a Convolutional Neural Network architecture based on the VGG16 model, modified to accommodate mel spectrogram inputs of variable sizes generated from the Voxceleb1 dataset. Our approach involves implementing both loss functions to analyze their effects on model accuracy and robustness, where the Softmax loss function was employed as a comparative baseline. Additionally, we examine how the sizes of mel spectrograms and their varying time lengths influence model performance. The experimental results demonstrate superior identification accuracy compared to traditional Softmax loss methods. Furthermore, we discuss the implications of these findings for future research.

artificial intelligence, loss function, machine learning, (21 more...)

arXiv.org Artificial Intelligence

2509.22838

Genre: Research Report > New Finding (0.34)

Technology:

Information Technology > Artificial Intelligence > Speech > Acoustic Processing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.89)

Add feedback

Speech Unlearning

Cheng, Jiali, Amiri, Hadi

arXiv.org Artificial IntelligenceJun-3-2025

We introduce machine unlearning for speech tasks, a novel and underexplored research problem that aims to efficiently and effectively remove the influence of specific data from trained speech models without full retraining. This has important applications in privacy preservation, removal of outdated or noisy data, and bias mitigation. While machine unlearning has been studied in computer vision and natural language processing, its application to speech is largely unexplored due to the high-dimensional, sequential, and speaker-dependent nature of speech data. We define two fundamental speech unlearning tasks: sample unlearning, which removes individual data points (e.g., a voice recording), and class unlearning, which removes an entire category (e.g., all data from a speaker), while preserving performance on the remaining data. Experiments on keyword spotting and speaker identification demonstrate that unlearning speech data is significantly more challenging than unlearning image or text data. We conclude with key future directions in this area, including structured training, robust evaluation, feature-level unlearning, broader applications, scalable methods, and adversarial robustness.

artificial intelligence, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2506.00848

Country: North America > United States (0.28)

Genre: Research Report > New Finding (0.46)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
(2 more...)

Add feedback

Whisper Speaker Identification: Leveraging Pre-Trained Multilingual Transformers for Robust Speaker Embeddings

Emon, Jakaria Islam, Salek, Md Abu, Alam, Kazi Tamanna

arXiv.org Artificial IntelligenceMar-13-2025

Speaker identification in multilingual settings presents unique challenges, particularly when conventional models are predominantly trained on English data. In this paper, we propose WSI (Whisper Speaker Identification), a framework that repurposes the encoder of the Whisper automatic speech recognition model pre trained on extensive multilingual data to generate robust speaker embeddings via a joint loss optimization strategy that leverages online hard triplet mining and self supervised Normalized Temperature-scaled Cross Entropy loss. By capitalizing on Whisper language-agnostic acoustic representations, our approach effectively distinguishes speakers across diverse languages and recording conditions. Extensive evaluations on multiple corpora, including VoxTube (multilingual), JVS (Japanese), CallHome (German, Spanish, Chinese, and Japanese), and Voxconverse (English), demonstrate that WSI consistently outperforms state-of-the-art baselines, namely Pyannote Embedding, ECAPA TDNN, and Xvector, in terms of lower equal error rates and higher AUC scores. These results validate our hypothesis that a multilingual pre-trained ASR encoder, combined with joint loss optimization, substantially improves speaker identification performance in non-English languages.

embedding, speaker identification, speaker verification, (13 more...)

arXiv.org Artificial Intelligence

2503.10446

Country:

Asia > Japan > Hokkaidō > Hokkaidō Prefecture > Sapporo (0.04)
Asia > Bangladesh (0.04)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Speech > Acoustic Processing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.94)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.90)

Add feedback

Friends-MMC: A Dataset for Multi-modal Multi-party Conversation Understanding

Wang, Yueqian, Meng, Xiaojun, Wang, Yuxuan, Liang, Jianxin, Liu, Qun, Zhao, Dongyan

arXiv.org Artificial IntelligenceDec-23-2024

Multi-modal multi-party conversation (MMC) is a less studied yet important topic of research due to that it well fits real-world scenarios and thus potentially has more widely-used applications. Compared with the traditional multi-modal conversations, MMC requires stronger character-centered understanding abilities as there are many interlocutors appearing in both the visual and textual context. To facilitate the study of this problem, we present Friends-MMC in this paper, an MMC dataset that contains 24,000+ unique utterances paired with video context. To explore the character-centered understanding of the dialogue, we also annotate the speaker of each utterance, the names and bounding bboxes of faces that appear in the video. Based on this Friends-MMC dataset, we further study two fundamental MMC tasks: conversation speaker identification and conversation response prediction, both of which have the multi-party nature with the video or image as visual context. For conversation speaker identification, we demonstrate the inefficiencies of existing methods such as pre-trained models, and propose a simple yet effective baseline method that leverages an optimization solver to utilize the context of two modalities to achieve better performance. For conversation response prediction, we fine-tune generative dialogue models on Friend-MMC, and analyze the benefits of speaker information. The code and dataset is publicly available at https://github.com/yellow-binary-tree/Friends-MMC and thus we call for more attention on modeling speaker information when understanding conversations.

information, large language model, machine learning, (20 more...)

arXiv.org Artificial Intelligence

2412.17295

Country:

North America > Canada > Ontario > Toronto (0.04)
Europe > Czechia > Prague (0.04)
Asia > China > Beijing > Beijing (0.04)

Genre: Research Report (1.00)

Industry:

Leisure & Entertainment (0.93)
Media > Television (0.68)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

Towards Speaker Identification with Minimal Dataset and Constrained Resources using 1D-Convolution Neural Network

Shahan, Irfan Nafiz, Auvi, Pulok Ahmed

arXiv.org Artificial IntelligenceNov-22-2024

Voice recognition and speaker identification are vital for applications in security and personal assistants. This paper presents a lightweight 1D-Convolutional Neural Network (1D-CNN) designed to perform speaker identification on minimal datasets. Our approach achieves a validation accuracy of 97.87%, leveraging data augmentation techniques to handle background noise and limited training samples. Future improvements include testing on larger datasets and integrating transfer learning methods to enhance generalizability. We provide all code, the custom dataset, and the trained models to facilitate reproducibility. These resources are available on our GitHub repository: https://github.com/IrfanNafiz/RecMe.

artificial intelligence, machine learning, minimal dataset and constrained resource, (4 more...)

arXiv.org Artificial Intelligence

2411.15082

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Speech > Acoustic Processing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)

Add feedback

Filters

Collaborating Authors

speaker identification

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

fe79898dcf078ec54b6feeea10ebb751-Supplemental-Datasets_and_Benchmarks_Track.pdf

CoMix: A Comprehensive Benchmark for Multi-T ask Comic Understanding Supplementary Materials

CoMix: A Comprehensive Benchmark for Multi-Task Comic Understanding

Multi-Target Backdoor Attacks Against Speaker Recognition

From Dialect Gaps to Identity Maps: Tackling Variability in Speaker Verification

Text-Independent Speaker Identification Using Audio Looping With Margin Based Loss Functions

Speech Unlearning

Whisper Speaker Identification: Leveraging Pre-Trained Multilingual Transformers for Robust Speaker Embeddings

Friends-MMC: A Dataset for Multi-modal Multi-party Conversation Understanding

Towards Speaker Identification with Minimal Dataset and Constrained Resources using 1D-Convolution Neural Network