MVCL-DAF++: Enhancing Multimodal Intent Recognition via Prototype-Aware Contrastive Alignment and Coarse-to-Fine Dynamic Attention Fusion
Huang, Haofeng, Han, Yifei, Zhang, Long, Li, Bin, He, Yangfan
–arXiv.org Artificial Intelligence
ABSTRACT Multimodal intent recognition (MMIR) suffers from weak semantic grounding and poor robustness under noisy or rare-class conditions. We propose MVCL-DAF++, which extends MVCL-DAF with two key modules: (1) Prototype-aware contrastive alignment, aligning instances to class-level prototypes to enhance semantic consistency; and (2) Coarse-to-fine attention fusion, integrating global modality summaries with token-level features for hierarchical cross-modal interaction. These results demonstrate the effectiveness of prototype-guided learning and coarse-to-fine fusion for robust multimodal understanding. Index T erms-- Multimodal intent recognition, Prototype-aware contrastive alignment, Coarse-to-fine dynamic attention fusion 1. INTRODUCTION Multimodal intent recognition (MMIR) [1] aims to infer user intentions by integrating heterogeneous signals such as spoken language, facial expressions, and vocal intonations. With the rapid adoption of human-centered AI systems [2], robust and generalizable multimodal understanding has become a cornerstone for building intelligent conversational agents [3, 4].
arXiv.org Artificial Intelligence
Sep-24-2025
- Country:
- Asia > China
- North America > United States
- Minnesota (0.04)
- Genre:
- Research Report (0.84)
- Technology: