AITopics | diarisation

Collaborating Authors

diarisation

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Speaker Retrieval in the Wild: Challenges, Effectiveness and Robustness

Loweimi, Erfan, Qian, Mengjie, Knill, Kate, Gales, Mark

arXiv.org Artificial IntelligenceApr-30-2025

There is a growing abundance of publicly available or company-owned audio/video archives, highlighting the increasing importance of efficient access to desired content and information retrieval from these archives. This paper investigates the challenges, solutions, effectiveness, and robustness of speaker retrieval systems developed "in the wild" which involves addressing two primary challenges: extraction of task-relevant labels from limited metadata for system development and evaluation, as well as the unconstrained acoustic conditions encountered in the archive, ranging from quiet studios to adverse noisy environments. While we focus on the publicly-available BBC Rewind archive (spanning 1948 to 1979), our framework addresses the broader issue of speaker retrieval on extensive and possibly aged archives with no control over the content and acoustic conditions. Typically, these archives offer a brief and general file description, mostly inadequate for specific applications like speaker retrieval, and manual annotation of such large-scale archives is unfeasible. We explore various aspects of system development (e.g., speaker diarisation, embedding extraction, query selection) and analyse the challenges, possible solutions, and their functionality. To evaluate the performance, we conduct systematic experiments in both clean setup and against various distortions simulating real-world applications. Our findings demonstrate the effectiveness and robustness of the developed speaker retrieval systems, establishing the versatility and scalability of the proposed framework for a wide range of applications beyond the BBC Rewind corpus.

artificial intelligence, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2504.1895

Country: Europe > United Kingdom > England (0.28)

Genre:

Overview (1.00)
Research Report > New Finding (0.68)

Industry:

Information Technology > Security & Privacy (0.46)
Government (0.46)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.68)

Add feedback

The VoxCeleb Speaker Recognition Challenge: A Retrospective

Huh, Jaesung, Chung, Joon Son, Nagrani, Arsha, Brown, Andrew, Jung, Jee-weon, Garcia-Romero, Daniel, Zisserman, Andrew

arXiv.org Artificial IntelligenceAug-27-2024

The VoxCeleb Speaker Recognition Challenges (VoxSRC) were a series of challenges and workshops that ran annually from 2019 to 2023. The challenges primarily evaluated the tasks of speaker recognition and diarisation under various settings including: closed and open training data; as well as supervised, self-supervised, and semi-supervised training for domain adaptation. The challenges also provided publicly available training and evaluation datasets for each task and setting, with new test sets released each year. In this paper, we provide a review of these challenges that covers: what they explored; the methods developed by the challenge participants and how these evolved; and also the current state of the field for speaker verification and diarisation. We chart the progress in performance over the five installments of the challenge on a common evaluation dataset and provide a detailed analysis of how each year's special focus affected participants' performance. This paper is aimed both at researchers who want an overview of the speaker recognition and diarisation field, and also at challenge organisers who want to benefit from the successes and avoid the mistakes of the VoxSRC challenges. We end with a discussion of the current strengths of the field and open challenges. Project page : https://mm.kaist.ac.kr/datasets/voxceleb/voxsrc/workshop.html

artificial intelligence, machine learning, pattern recognition, (18 more...)

arXiv.org Artificial Intelligence

doi: 10.1109/TASLP.2024.3444456

2408.14886

Country:

Europe > United Kingdom > England > Oxfordshire > Oxford (0.14)
North America > United States > Virginia > Arlington County > Arlington (0.04)
North America > United States > Pennsylvania (0.04)
(11 more...)

Genre:

Overview (1.00)
Research Report > Experimental Study (0.46)

Industry:

Information Technology > Security & Privacy (0.92)
Media (0.67)
Education (0.67)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Pattern Recognition > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.91)
(2 more...)

Add feedback

Detecting agreement in multi-party dialogue: evaluating speaker diarisation versus a procedural baseline to enhance user engagement

Addlesee, Angus, Denley, Daniel, Edmondson, Andy, Gunson, Nancie, Garcia, Daniel Hernández, Kha, Alexandre, Lemon, Oliver, Ndubuisi, James, O'Reilly, Neil, Perochaud, Lia, Valeri, Raphaël, Worika, Miebaka

arXiv.org Artificial IntelligenceNov-6-2023

Conversational agents participating in multi-party interactions face significant challenges in dialogue state tracking, since the identity of the speaker adds significant contextual meaning. It is common to utilise diarisation models to identify the speaker. However, it is not clear if these are accurate enough to correctly identify specific conversational events such as agreement or disagreement during a real-time interaction. This study uses a cooperative quiz, where the conversational agent acts as quiz-show host, to determine whether diarisation or a frequency-and-proximity-based method is more accurate at determining agreement, and whether this translates to feelings of engagement from the players. Experimental results show that our procedural system was more engaging to players, and was more accurate at detecting agreement, reaching an average accuracy of 0.44 compared to 0.28 for the diarised system.

agreement, participant, proceedings, (16 more...)

arXiv.org Artificial Intelligence

2311.03021

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Oceania > Australia > Australian Indian Ocean Territories > Christmas Island (0.05)
North America > Antigua and Barbuda (0.05)
(9 more...)

Genre: Research Report (0.84)

Technology:

Information Technology > Human Computer Interaction (1.00)
Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
(3 more...)

Add feedback

In search of strong embedding extractors for speaker diarisation

Jung, Jee-weon, Heo, Hee-Soo, Lee, Bong-Jin, Huh, Jaesung, Brown, Andrew, Kwon, Youngki, Watanabe, Shinji, Chung, Joon Son

arXiv.org Artificial IntelligenceOct-26-2022

Speaker embedding extractors (EEs), which map input audio to a speaker discriminant latent space, are of paramount importance in speaker diarisation. However, there are several challenges when adopting EEs for diarisation, from which we tackle two key problems. First, the evaluation is not straightforward because the features required for better performance differ between speaker verification and diarisation. We show that better performance on widely adopted speaker verification evaluation protocols does not lead to better diarisation performance. Second, embedding extractors have not seen utterances in which multiple speakers exist. These inputs are inevitably present in speaker diarisation because of overlapped speech and speaker changes; they degrade the performance. To mitigate the first problem, we generate speaker verification evaluation protocols that mimic the diarisation scenario better. We propose two data augmentation techniques to alleviate the second problem, making embedding extractors aware of overlapped speech or speaker change input. One technique generates overlapped speech segments, and the other generates segments where two speakers utter sequentially. Extensive experimental results using three state-of-the-art speaker embedding extractors demonstrate that both proposed approaches are effective.

artificial intelligence, diarisation, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2210.14682

Country:

Europe > United Kingdom > England > Oxfordshire > Oxford (0.14)
Asia > South Korea (0.04)
North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Speech > Acoustic Processing (0.77)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.69)

Add feedback

Multi-scale speaker embedding-based graph attention networks for speaker diarisation

Kwon, Youngki, Heo, Hee-Soo, Jung, Jee-weon, Kim, You Jin, Lee, Bong-Jin, Chung, Joon Son

arXiv.org Artificial IntelligenceOct-7-2021

The objective of this work is effective speaker diarisation using multi-scale speaker embeddings. Typically, there is a trade-off between the ability to recognise short speaker segments and the discriminative power of the embedding, according to the segment length used for embedding extraction. To this end, recent works have proposed the use of multi-scale embeddings where segments with varying lengths are used. However, the scores are combined using a weighted summation scheme where the weights are fixed after the training phase, whereas the importance of segment lengths can differ with in a single session. To address this issue, we present three key contributions in this paper: (1) we propose graph attention networks for multi-scale speaker diarisation; (2) we design scale indicators to utilise scale information of each embedding; (3) we adapt the attention-based aggregation to utilise a pre-computed affinity matrix from multi-scale embeddings. We demonstrate the effectiveness of our method in various datasets where the speaker confusion which constitutes the primary metric drops over 10% in average relative compared to the baseline.

affinity matrix, attention network, baseline, (14 more...)

arXiv.org Artificial Intelligence

2110.03361

Country: Asia > South Korea (0.04)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Speech (0.70)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.46)

Add feedback

Joint speaker diarisation and tracking in switching state-space model

Wong, Jeremy H. M., Gong, Yifan

arXiv.org Artificial IntelligenceSep-23-2021

Speakers may move around while diarisation is being performed. When a microphone array is used, the instantaneous locations of where the sounds originated from can be estimated, and previous investigations have shown that such information can be complementary to speaker embeddings in the diarisation task. However, these approaches often assume that speakers are fairly stationary throughout a meeting. This paper relaxes this assumption, by proposing to explicitly track the movements of speakers while jointly performing diarisation within a unified model. A state-space model is proposed, where the hidden state expresses the identity of the current active speaker and the predicted locations of all speakers. The model is implemented as a particle filter. Experiments on a Microsoft rich meeting transcription task show that the proposed joint location tracking and diarisation approach is able to perform comparably with other methods that use location information.

information, likelihood, particle, (17 more...)

arXiv.org Artificial Intelligence

2109.1114

Country:

Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.14)
Europe > Italy > Tuscany > Florence (0.04)
South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
(12 more...)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)

Add feedback