MVCL-DAF++: Enhancing Multimodal Intent Recognition via Prototype-Aware Contrastive Alignment and Coarse-to-Fine Dynamic Attention Fusion

Huang, Haofeng, Han, Yifei, Zhang, Long, Li, Bin, He, Yangfan

Sep-24-2025–arXiv.org Artificial Intelligence

ABSTRACT Multimodal intent recognition (MMIR) suffers from weak semantic grounding and poor robustness under noisy or rare-class conditions. We propose MVCL-DAF++, which extends MVCL-DAF with two key modules: (1) Prototype-aware contrastive alignment, aligning instances to class-level prototypes to enhance semantic consistency; and (2) Coarse-to-fine attention fusion, integrating global modality summaries with token-level features for hierarchical cross-modal interaction. These results demonstrate the effectiveness of prototype-guided learning and coarse-to-fine fusion for robust multimodal understanding. Index T erms-- Multimodal intent recognition, Prototype-aware contrastive alignment, Coarse-to-fine dynamic attention fusion 1. INTRODUCTION Multimodal intent recognition (MMIR) [1] aims to infer user intentions by integrating heterogeneous signals such as spoken language, facial expressions, and vocal intonations. With the rapid adoption of human-centered AI systems [2], robust and generalizable multimodal understanding has become a cornerstone for building intelligent conversational agents [3, 4].

artificial intelligence, belief revision, machine learning, (16 more...)

arXiv.org Artificial Intelligence

Sep-24-2025

arXiv.org PDF

Add feedback

Country:
- Asia > China (0.14)

Genre:
- Research Report (0.84)

Technology:
- Information Technology > Artificial Intelligence
  - Representation & Reasoning > Belief Revision (1.00)
  - Machine Learning (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found