Speaker Style-Aware Phoneme Anchoring for Improved Cross-Lingual Speech Emotion Recognition
Upadhyay, Shreya G., Busso, Carlos, Lee, Chi-Chun
–arXiv.org Artificial Intelligence
Cross-lingual speech emotion recognition (SER) remains a challenging task due to differences in phonetic variability and speaker-specific expressive styles across languages. Effectively capturing emotion under such diverse conditions requires a framework that can align the externalization of emotions across different speakers and languages. To address this problem, we propose a speaker-style aware phoneme anchoring framework that aligns emotional expression at the phonetic and speaker levels. Our method builds emotion-specific speaker communities via graph-based clustering to capture shared speaker traits. Using these groups, we apply dual-space anchoring in speaker and phonetic spaces to enable better emotion transfer across languages. Evaluations on the MSP-Podcast (English) and BIIC-Podcast (Taiwanese Mandarin) corpora demonstrate improved generalization over competitive baselines and provide valuable insights into the commonalities in cross-lingual emotion representation.
arXiv.org Artificial Intelligence
Sep-26-2025
- Country:
- Asia
- Europe
- North America > United States
- Massachusetts > Middlesex County
- Cambridge (0.04)
- Rhode Island (0.04)
- Texas
- Bexar County > San Antonio (0.04)
- El Paso County > El Paso (0.04)
- Massachusetts > Middlesex County
- Genre:
- Research Report > New Finding (0.68)
- Technology:
- Information Technology > Artificial Intelligence
- Cognitive Science > Emotion (0.74)
- Machine Learning
- Neural Networks (0.46)
- Statistical Learning (0.68)
- Natural Language (1.00)
- Representation & Reasoning (0.68)
- Speech (0.95)
- Information Technology > Artificial Intelligence