A-MESS: Anchor based Multimodal Embedding with Semantic Synchronization for Multimodal Intent Recognition
Shen, Yaomin, Lin, Xiaojian, Fan, Wei
–arXiv.org Artificial Intelligence
--In the domain of multimodal intent recognition (MIR), the objective is to recognize human intent by integrating a variety of modalities, such as language text, body gestures, and tones. However, existing approaches face difficulties adequately capturing the intrinsic connections between the modalities and overlooking the corresponding semantic representations of intent. T o address these limitations, we present the Anchor-based Mul-timodal Embedding with Semantic Synchronization (A-MESS) framework. We first design an Anchor-based Multimodal Embedding (A-ME) module that employs an anchor-based embedding fusion mechanism to integrate multimodal inputs. Furthermore, we develop a Semantic Synchronization (SS) strategy with the Triplet Contrastive Learning pipeline, which optimizes the process by synchronizing multimodal representation with label descriptions produced by the large language model. Comprehensive experiments indicate that our A-MESS achieves state-of-the-art and provides substantial insight into multimodal representation and downstream tasks. In the field of natural language understanding, the mul-timodal intent recognition (MIR) task, used to categorize intent within goal-driven context based on textual, visual and auditory information, has been identified as a critical element in identifying complex human behavioral intent [1]. Especially in AI Agent [2] applications, for example, when users need to command the AI agent to do specific tasks, the AI agent can perform the tasks well only if it correctly understands the intent behind the user's commands. Compared to the method [3] that relies solely on a single data type, the use of multiple data types provides a more substantial information base, which can improve the accuracy of identifying complex intent categories.
arXiv.org Artificial Intelligence
Apr-1-2025