MVCL-DAF++: Enhancing Multimodal Intent Recognition via Prototype-Aware Contrastive Alignment and Coarse-to-Fine Dynamic Attention Fusion

Huang, Haofeng, Han, Yifei, Zhang, Long, Li, Bin, He, Yangfan

arXiv.org Artificial Intelligence 

ABSTRACT Multimodal intent recognition (MMIR) suffers from weak semantic grounding and poor robustness under noisy or rare-class conditions. We propose MVCL-DAF++, which extends MVCL-DAF with two key modules: (1) Prototype-aware contrastive alignment, aligning instances to class-level prototypes to enhance semantic consistency; and (2) Coarse-to-fine attention fusion, integrating global modality summaries with token-level features for hierarchical cross-modal interaction. These results demonstrate the effectiveness of prototype-guided learning and coarse-to-fine fusion for robust multimodal understanding. Index T erms-- Multimodal intent recognition, Prototype-aware contrastive alignment, Coarse-to-fine dynamic attention fusion 1. INTRODUCTION Multimodal intent recognition (MMIR) [1] aims to infer user intentions by integrating heterogeneous signals such as spoken language, facial expressions, and vocal intonations. With the rapid adoption of human-centered AI systems [2], robust and generalizable multimodal understanding has become a cornerstone for building intelligent conversational agents [3, 4].

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found