speaker information
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
- Asia > South Korea > Seoul > Seoul (0.04)
- (2 more...)
Iterative refinement, not training objective, makes HuBERT behave differently from wav2vec 2.0
Self-supervised models for speech representation learning now see widespread use for their versatility and performance on downstream tasks, but the effect of model architecture on the linguistic information learned in their representations remains under-studied. This study investigates two such models, HuBERT and wav2vec 2.0, and minimally compares two of their architectural differences: training objective and iterative pseudo-label refinement through multiple training iterations. We find that differences in canonical correlation of hidden representations to word identity, phoneme identity, and speaker identity are explained by training iteration, not training objective. We suggest that future work investigate the reason for the effectiveness of iterative refinement in encoding linguistic information in self-supervised speech representations.
- North America > Canada > Ontario > Toronto (0.15)
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- North America > United States > Rhode Island (0.04)
- (2 more...)
Identifying Speaker Information in Feed-Forward Layers of Self-Supervised Speech Transformers
Lin, Tzu-Quan, Cheng, Hsi-Chun, Lee, Hung-yi, Tang, Hao
In recent years, the impact of self-supervised speech Transformers has extended to speaker-related applications. However, little research has explored how these models encode speaker information. In this work, we address this gap by identifying neurons in the feed-forward layers that are correlated with speaker information. Specifically, we analyze neurons associated with k-means clusters of self-supervised features and i-vectors. Our analysis reveals that these clusters correspond to broad phonetic and gender classes, making them suitable for identifying neurons that represent speakers. By protecting these neurons during pruning, we can significantly preserve performance on speaker-related task, demonstrating their crucial role in encoding speaker information.
- Asia > Taiwan (0.04)
- Europe > United Kingdom > Scotland > City of Edinburgh > Edinburgh (0.04)
Analyzing the relationships between pretraining language, phonetic, tonal, and speaker information in self-supervised speech models
Gubian, Michele, Krehan, Ioana, Liu, Oli, Kirby, James, Goldwater, Sharon
Analyses of self-supervised speech models have begun to reveal where and how they represent different types of information. However, almost all analyses have focused on English. Here, we examine how wav2vec2 models trained on four different languages encode both language-matched and non-matched speech. We use probing classifiers and geometric analyses to examine how phones, lexical tones, and speaker information are represented. We show that for all pretraining and test languages, the subspaces encoding phones, tones, and speakers are largely orthogonal, and that layerwise patterns of probing accuracy are similar, with a relatively small advantage for matched-language phone and tone (but not speaker) probes in the later layers. Our findings suggest that the structure of representations learned by wav2vec2 is largely independent of the speech material used during pretraining.
- North America > United States > Massachusetts (0.04)
- North America > Canada > Quebec > Montreal (0.04)
- Europe > Germany > Bavaria > Upper Bavaria > Munich (0.04)
Towards Robust Overlapping Speech Detection: A Speaker-Aware Progressive Approach Using WavLM
Sun, Zhaokai, Zhang, Li, Wang, Qing, Zhou, Pan, Xie, Lei
Overlapping Speech Detection (OSD) aims to identify regions where multiple speakers overlap in a conversation, a critical challenge in multi-party speech processing. This work proposes a speaker-aware progressive OSD model that leverages a progressive training strategy to enhance the correlation between subtasks such as voice activity detection (V AD) and overlap detection. To improve acoustic representation, we explore the effectiveness of state-of-the-art self-supervised learning (SSL) models, including WavLM and wav2vec 2.0, while incorporating a speaker attention module to enrich features with frame-level speaker information. Experimental results show that the proposed method achieves state-of-the-art performance, with an F1 score of 82.76% on the AMI test set, demonstrating its robustness and effectiveness in OSD.
TED: Turn Emphasis with Dialogue Feature Attention for Emotion Recognition in Conversation
Emotion recognition in conversation (ERC) has been attracting attention by methods for modeling multi-turn contexts. The multi-turn input to a pretraining model implicitly assumes that the current turn and other turns are distinguished during the training process by inserting special tokens into the input sequence. This paper proposes a priority-based attention method to distinguish each turn explicitly by adding dialogue features into the attention mechanism, called Turn Emphasis with Dialogue (TED). It has a priority for each turn according to turn position and speaker information as dialogue features. It takes multi-head self-attention between turn-based vectors for multi-turn input and adjusts attention scores with the dialogue features. We evaluate TED on four typical benchmarks. The experimental results demonstrate that TED has high overall performance in all datasets and achieves state-of-the-art performance on IEMOCAP with numerous turns.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Europe > Spain > Catalonia > Barcelona Province > Barcelona (0.04)
- Asia > China (0.04)
- (3 more...)
CrossSpeech++: Cross-lingual Speech Synthesis with Decoupled Language and Speaker Generation
Kim, Ji-Hoon, Yang, Hong-Sun, Ju, Yoon-Cheol, Kim, Il-Hwan, Kim, Byeong-Yeol, Chung, Joon Son
The goal of this work is to generate natural speech in multiple languages while maintaining the same speaker identity, a task known as cross-lingual speech synthesis. A key challenge of cross-lingual speech synthesis is the language-speaker entanglement problem, which causes the quality of cross-lingual systems to lag behind that of intra-lingual systems. In this paper, we propose CrossSpeech++, which effectively disentangles language and speaker information and significantly improves the quality of cross-lingual speech synthesis. To this end, we break the complex speech generation pipeline into two simple components: language-dependent and speaker-dependent generators. The language-dependent generator produces linguistic variations that are not biased by specific speaker attributes. The speaker-dependent generator models acoustic variations that characterize speaker identity. By handling each type of information in separate modules, our method can effectively disentangle language and speaker representation. We conduct extensive experiments using various metrics, and demonstrate that CrossSpeech++ achieves significant improvements in cross-lingual speech synthesis, outperforming existing methods by a large margin.
- Asia > South Korea > Seoul > Seoul (0.05)
- Asia > South Korea > Daejeon > Daejeon (0.04)
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
- (2 more...)
Friends-MMC: A Dataset for Multi-modal Multi-party Conversation Understanding
Wang, Yueqian, Meng, Xiaojun, Wang, Yuxuan, Liang, Jianxin, Liu, Qun, Zhao, Dongyan
Multi-modal multi-party conversation (MMC) is a less studied yet important topic of research due to that it well fits real-world scenarios and thus potentially has more widely-used applications. Compared with the traditional multi-modal conversations, MMC requires stronger character-centered understanding abilities as there are many interlocutors appearing in both the visual and textual context. To facilitate the study of this problem, we present Friends-MMC in this paper, an MMC dataset that contains 24,000+ unique utterances paired with video context. To explore the character-centered understanding of the dialogue, we also annotate the speaker of each utterance, the names and bounding bboxes of faces that appear in the video. Based on this Friends-MMC dataset, we further study two fundamental MMC tasks: conversation speaker identification and conversation response prediction, both of which have the multi-party nature with the video or image as visual context. For conversation speaker identification, we demonstrate the inefficiencies of existing methods such as pre-trained models, and propose a simple yet effective baseline method that leverages an optimization solver to utilize the context of two modalities to achieve better performance. For conversation response prediction, we fine-tune generative dialogue models on Friend-MMC, and analyze the benefits of speaker information. The code and dataset is publicly available at https://github.com/yellow-binary-tree/Friends-MMC and thus we call for more attention on modeling speaker information when understanding conversations.
Enhancing Talk Moves Analysis in Mathematics Tutoring through Classroom Teaching Discourse
Cao, Jie, Suresh, Abhijit, Jacobs, Jennifer, Clevenger, Charis, Howard, Amanda, Brown, Chelsea, Milne, Brent, Fischaber, Tom, Sumner, Tamara, Martin, James H.
Human tutoring interventions play a crucial role in supporting student learning, improving academic performance, and promoting personal growth. This paper focuses on analyzing mathematics tutoring discourse using talk moves - a framework of dialogue acts grounded in Accountable Talk theory. However, scaling the collection, annotation, and analysis of extensive tutoring dialogues to develop machine learning models is a challenging and resource-intensive task. To address this, we present SAGA22, a compact dataset, and explore various modeling strategies, including dialogue context, speaker information, pretraining datasets, and further fine-tuning. By leveraging existing datasets and models designed for classroom teaching, our results demonstrate that supplementary pretraining on classroom data enhances model performance in tutoring settings, particularly when incorporating longer context and speaker information. Additionally, we conduct extensive ablation studies to underscore the challenges in talk move modeling.
- North America > United States > Colorado > Boulder County > Boulder (0.04)
- North America > United States > Washington > King County > Seattle (0.04)
- North America > United States > Texas > Dallas County > Dallas (0.04)
- (4 more...)
- Education > Educational Setting (1.00)
- Education > Curriculum > Subject-Specific Education (0.68)
- Education > Educational Technology > Educational Software > Computer Based Training (0.34)
SKQVC: One-Shot Voice Conversion by K-Means Quantization with Self-Supervised Speech Representations
Sim, Youngjun, Yoon, Jinsung, Suh, Young-Joo
FreeVC [12] speaker identity of a source into that of an arbitrary target captures content information using SSL features, combined using only a single utterance. This process typically employs with data perturbation, a bottleneck network, and a conditional disentanglement-based methods to separate content and normalizing flow method, while employing external speaker speaker information, replacing the source speaker's information embedding to achieve high naturalness and similarity in voice with that of the target speaker. The key challenge lies conversion. in effectively disentangling content and speaker information SSL features, which are speech representations derived from while preserving both. To address this, various strategies have self-supervised learning (SSL) models such as HuBERT [14] been proposed, including information bottlenecks [1, 2], additional and WavLM [15], have demonstrated the ability to linearly loss functions [3, 4], normalization techniques [5, 6], predict various speech attributes [16]. These features are and vector quantization (VQ) methods [7-9]. VQ methods encoded such that instances of the same phone are closer capture content information by replacing the input embedding together than different phones, meaning that nearby features with the nearest vectors from a discrete codebook, which share similar phonetic content [17, 18]. Due to this inherent primarily represents phonetic features within the continuous characteristic, SSL features have been increasingly used in content space, thus removing speaker information.