AITopics | speaker information

Collaborating Authors

speaker information

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

87682805257e619d49b8e0dfdc14affa-Paper.pdf

Neural Information Processing SystemsAug-15-2025, 16:33:51 GMT

information, representation, voice conversion, (14 more...)

Neural Information Processing Systems

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
Asia > South Korea > Seoul > Seoul (0.04)
(2 more...)

Genre: Research Report (0.68)

Industry: Information Technology > Security & Privacy (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.68)

Add feedback

Iterative refinement, not training objective, makes HuBERT behave differently from wav2vec 2.0

Huo, Robin, Dunbar, Ewan

arXiv.org Artificial IntelligenceAug-12-2025

Self-supervised models for speech representation learning now see widespread use for their versatility and performance on downstream tasks, but the effect of model architecture on the linguistic information learned in their representations remains under-studied. This study investigates two such models, HuBERT and wav2vec 2.0, and minimally compares two of their architectural differences: training objective and iterative pseudo-label refinement through multiple training iterations. We find that differences in canonical correlation of hidden representations to word identity, phoneme identity, and speaker identity are explained by training iteration, not training objective. We suggest that future work investigate the reason for the effectiveness of iterative refinement in encoding linguistic information in self-supervised speech representations.

artificial intelligence, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2508.0811

Country:

North America > United States > Minnesota (0.28)
North America > Canada > Ontario > Toronto (0.15)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language (0.94)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.67)

Add feedback

Identifying Speaker Information in Feed-Forward Layers of Self-Supervised Speech Transformers

Lin, Tzu-Quan, Cheng, Hsi-Chun, Lee, Hung-yi, Tang, Hao

arXiv.org Artificial IntelligenceJun-30-2025

In recent years, the impact of self-supervised speech Transformers has extended to speaker-related applications. However, little research has explored how these models encode speaker information. In this work, we address this gap by identifying neurons in the feed-forward layers that are correlated with speaker information. Specifically, we analyze neurons associated with k-means clusters of self-supervised features and i-vectors. Our analysis reveals that these clusters correspond to broad phonetic and gender classes, making them suitable for identifying neurons that represent speakers. By protecting these neurons during pruning, we can significantly preserve performance on speaker-related task, demonstrating their crucial role in encoding speaker information.

artificial intelligence, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2506.21712

Country:

Asia > Taiwan (0.04)
Europe > United Kingdom > Scotland > City of Edinburgh > Edinburgh (0.04)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.95)
Information Technology > Artificial Intelligence > Natural Language (0.94)

Add feedback

Analyzing the relationships between pretraining language, phonetic, tonal, and speaker information in self-supervised speech models

Gubian, Michele, Krehan, Ioana, Liu, Oli, Kirby, James, Goldwater, Sharon

arXiv.org Artificial IntelligenceJun-13-2025

Analyses of self-supervised speech models have begun to reveal where and how they represent different types of information. However, almost all analyses have focused on English. Here, we examine how wav2vec2 models trained on four different languages encode both language-matched and non-matched speech. We use probing classifiers and geometric analyses to examine how phones, lexical tones, and speaker information are represented. We show that for all pretraining and test languages, the subspaces encoding phones, tones, and speakers are largely orthogonal, and that layerwise patterns of probing accuracy are similar, with a relatively small advantage for matched-language phone and tone (but not speaker) probes in the later layers. Our findings suggest that the structure of representations learned by wav2vec2 is largely independent of the speech material used during pretraining.

artificial intelligence, machine learning, representation, (18 more...)

arXiv.org Artificial Intelligence

2506.10855

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.47)

Add feedback

Towards Robust Overlapping Speech Detection: A Speaker-Aware Progressive Approach Using WavLM

Sun, Zhaokai, Zhang, Li, Wang, Qing, Zhou, Pan, Xie, Lei

arXiv.org Artificial IntelligenceMay-30-2025

Overlapping Speech Detection (OSD) aims to identify regions where multiple speakers overlap in a conversation, a critical challenge in multi-party speech processing. This work proposes a speaker-aware progressive OSD model that leverages a progressive training strategy to enhance the correlation between subtasks such as voice activity detection (V AD) and overlap detection. To improve acoustic representation, we explore the effectiveness of state-of-the-art self-supervised learning (SSL) models, including WavLM and wav2vec 2.0, while incorporating a speaker attention module to enrich features with frame-level speaker information. Experimental results show that the proposed method achieves state-of-the-art performance, with an F1 score of 82.76% on the AMI test set, demonstrating its robustness and effectiveness in OSD.

artificial intelligence, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2505.23207

Genre: Research Report > New Finding (0.34)

Technology:

Information Technology > Artificial Intelligence > Speech (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language (0.89)

Add feedback

TED: Turn Emphasis with Dialogue Feature Attention for Emotion Recognition in Conversation

Ono, Junya, Wakaki, Hiromi

arXiv.org Artificial IntelligenceJan-2-2025

Emotion recognition in conversation (ERC) has been attracting attention by methods for modeling multi-turn contexts. The multi-turn input to a pretraining model implicitly assumes that the current turn and other turns are distinguished during the training process by inserting special tokens into the input sequence. This paper proposes a priority-based attention method to distinguish each turn explicitly by adding dialogue features into the attention mechanism, called Turn Emphasis with Dialogue (TED). It has a priority for each turn according to turn position and speaker information as dialogue features. It takes multi-head self-attention between turn-based vectors for multi-turn input and adjusts attention scores with the dialogue features. We evaluate TED on four typical benchmarks. The experimental results demonstrate that TED has high overall performance in all datasets and achieves state-of-the-art performance on IEMOCAP with numerous turns.

computational linguistic, current turn, speaker information, (14 more...)

arXiv.org Artificial Intelligence

2501.01123

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Europe > Spain > Catalonia > Barcelona Province > Barcelona (0.04)
Asia > China (0.04)
(3 more...)

Genre: Research Report > New Finding (0.66)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Cognitive Science > Emotion (0.62)

Add feedback

CrossSpeech++: Cross-lingual Speech Synthesis with Decoupled Language and Speaker Generation

Kim, Ji-Hoon, Yang, Hong-Sun, Ju, Yoon-Cheol, Kim, Il-Hwan, Kim, Byeong-Yeol, Chung, Joon Son

arXiv.org Artificial IntelligenceDec-28-2024

The goal of this work is to generate natural speech in multiple languages while maintaining the same speaker identity, a task known as cross-lingual speech synthesis. A key challenge of cross-lingual speech synthesis is the language-speaker entanglement problem, which causes the quality of cross-lingual systems to lag behind that of intra-lingual systems. In this paper, we propose CrossSpeech++, which effectively disentangles language and speaker information and significantly improves the quality of cross-lingual speech synthesis. To this end, we break the complex speech generation pipeline into two simple components: language-dependent and speaker-dependent generators. The language-dependent generator produces linguistic variations that are not biased by specific speaker attributes. The speaker-dependent generator models acoustic variations that characterize speaker identity. By handling each type of information in separate modules, our method can effectively disentangle language and speaker representation. We conduct extensive experiments using various metrics, and demonstrate that CrossSpeech++ achieves significant improvements in cross-lingual speech synthesis, outperforming existing methods by a large margin.

artificial intelligence, machine learning, representation, (16 more...)

arXiv.org Artificial Intelligence

2412.20048

Country: Asia > South Korea (0.29)

Genre: Research Report (1.00)

Industry: Information Technology (0.46)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Synthesis (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)

Add feedback

Friends-MMC: A Dataset for Multi-modal Multi-party Conversation Understanding

Wang, Yueqian, Meng, Xiaojun, Wang, Yuxuan, Liang, Jianxin, Liu, Qun, Zhao, Dongyan

arXiv.org Artificial IntelligenceDec-23-2024

Multi-modal multi-party conversation (MMC) is a less studied yet important topic of research due to that it well fits real-world scenarios and thus potentially has more widely-used applications. Compared with the traditional multi-modal conversations, MMC requires stronger character-centered understanding abilities as there are many interlocutors appearing in both the visual and textual context. To facilitate the study of this problem, we present Friends-MMC in this paper, an MMC dataset that contains 24,000+ unique utterances paired with video context. To explore the character-centered understanding of the dialogue, we also annotate the speaker of each utterance, the names and bounding bboxes of faces that appear in the video. Based on this Friends-MMC dataset, we further study two fundamental MMC tasks: conversation speaker identification and conversation response prediction, both of which have the multi-party nature with the video or image as visual context. For conversation speaker identification, we demonstrate the inefficiencies of existing methods such as pre-trained models, and propose a simple yet effective baseline method that leverages an optimization solver to utilize the context of two modalities to achieve better performance. For conversation response prediction, we fine-tune generative dialogue models on Friend-MMC, and analyze the benefits of speaker information. The code and dataset is publicly available at https://github.com/yellow-binary-tree/Friends-MMC and thus we call for more attention on modeling speaker information when understanding conversations.

information, large language model, machine learning, (20 more...)

arXiv.org Artificial Intelligence

2412.17295

Country:

North America > Canada > Ontario > Toronto (0.04)
Europe > Czechia > Prague (0.04)
Asia > China > Beijing > Beijing (0.04)

Genre: Research Report (1.00)

Industry:

Leisure & Entertainment (0.93)
Media > Television (0.68)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

Enhancing Talk Moves Analysis in Mathematics Tutoring through Classroom Teaching Discourse

Cao, Jie, Suresh, Abhijit, Jacobs, Jennifer, Clevenger, Charis, Howard, Amanda, Brown, Chelsea, Milne, Brent, Fischaber, Tom, Sumner, Tamara, Martin, James H.

arXiv.org Artificial IntelligenceDec-17-2024

Human tutoring interventions play a crucial role in supporting student learning, improving academic performance, and promoting personal growth. This paper focuses on analyzing mathematics tutoring discourse using talk moves - a framework of dialogue acts grounded in Accountable Talk theory. However, scaling the collection, annotation, and analysis of extensive tutoring dialogues to develop machine learning models is a challenging and resource-intensive task. To address this, we present SAGA22, a compact dataset, and explore various modeling strategies, including dialogue context, speaker information, pretraining datasets, and further fine-tuning. By leveraging existing datasets and models designed for classroom teaching, our results demonstrate that supplementary pretraining on classroom data enhances model performance in tutoring settings, particularly when incorporating longer context and speaker information. Additionally, we conduct extensive ablation studies to underscore the challenges in talk move modeling.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2412.13395

Country: North America > United States (1.00)

Genre: Research Report > New Finding (0.68)

Industry:

Education > Educational Setting (1.00)
Education > Curriculum > Subject-Specific Education (0.68)
Education > Educational Technology > Educational Software > Computer Based Training (0.34)

Technology:

Information Technology > Communications (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

SKQVC: One-Shot Voice Conversion by K-Means Quantization with Self-Supervised Speech Representations

Sim, Youngjun, Yoon, Jinsung, Suh, Young-Joo

arXiv.org Artificial IntelligenceNov-25-2024

FreeVC [12] speaker identity of a source into that of an arbitrary target captures content information using SSL features, combined using only a single utterance. This process typically employs with data perturbation, a bottleneck network, and a conditional disentanglement-based methods to separate content and normalizing flow method, while employing external speaker speaker information, replacing the source speaker's information embedding to achieve high naturalness and similarity in voice with that of the target speaker. The key challenge lies conversion. in effectively disentangling content and speaker information SSL features, which are speech representations derived from while preserving both. To address this, various strategies have self-supervised learning (SSL) models such as HuBERT [14] been proposed, including information bottlenecks [1, 2], additional and WavLM [15], have demonstrated the ability to linearly loss functions [3, 4], normalization techniques [5, 6], predict various speech attributes [16]. These features are and vector quantization (VQ) methods [7-9]. VQ methods encoded such that instances of the same phone are closer capture content information by replacing the input embedding together than different phones, meaning that nearby features with the nearest vectors from a discrete codebook, which share similar phonetic content [17, 18]. Due to this inherent primarily represents phonetic features within the continuous characteristic, SSL features have been increasingly used in content space, thus removing speaker information.

information, speech, voice conversion, (13 more...)

arXiv.org Artificial Intelligence

2411.16147

Country: Asia > South Korea > Gyeongsangbuk-do > Pohang (0.05)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Speech (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.84)

Add feedback