Plug-and-Play Emotion Graphs for Compositional Prompting in Zero-Shot Speech Emotion Recognition
Shi, Jiacheng, Du, Hongfei, Hong, Y. Alicia, Gao, Ye
–arXiv.org Artificial Intelligence
Large audio-language models (LALMs) exhibit strong zero-shot performance across speech tasks but struggle with speech emotion recognition (SER) due to weak paralinguistic modeling and limited cross-modal reasoning. We propose Compositional Chain-of-Thought Prompting for Emotion Reasoning (CCoT-Emo), a framework that introduces structured Emotion Graphs (EGs) to guide LALMs in emotion inference without fine-tuning. Each EG encodes seven acoustic features (e.g., pitch, speech rate, jitter, shimmer), textual sentiment, keywords, and cross-modal associations. Embedded into prompts, EGs provide interpretable and compositional representations that enhance LALM reasoning. Experiments across SER benchmarks show that CCoT-Emo outperforms prior SOTA and improves accuracy over zero-shot baselines.
arXiv.org Artificial Intelligence
Oct-1-2025
- Genre:
- Research Report (0.82)
- Technology:
- Information Technology > Artificial Intelligence
- Cognitive Science > Emotion (0.63)
- Natural Language > Large Language Model (0.93)
- Speech > Speech Recognition (0.68)
- Information Technology > Artificial Intelligence