Plug-and-Play Emotion Graphs for Compositional Prompting in Zero-Shot Speech Emotion Recognition

Shi, Jiacheng, Du, Hongfei, Hong, Y. Alicia, Gao, Ye

Oct-1-2025–arXiv.org Artificial Intelligence

Large audio-language models (LALMs) exhibit strong zero-shot performance across speech tasks but struggle with speech emotion recognition (SER) due to weak paralinguistic modeling and limited cross-modal reasoning. We propose Compositional Chain-of-Thought Prompting for Emotion Reasoning (CCoT-Emo), a framework that introduces structured Emotion Graphs (EGs) to guide LALMs in emotion inference without fine-tuning. Each EG encodes seven acoustic features (e.g., pitch, speech rate, jitter, shimmer), textual sentiment, keywords, and cross-modal associations. Embedded into prompts, EGs provide interpretable and compositional representations that enhance LALM reasoning. Experiments across SER benchmarks show that CCoT-Emo outperforms prior SOTA and improves accuracy over zero-shot baselines.

artificial intelligence, large language model, natural language, (14 more...)

arXiv.org Artificial Intelligence

Oct-1-2025

arXiv.org PDF

Add feedback

Genre:
- Research Report (0.82)

Technology:
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)