Cross-modal Prompts: Adapting Large Pre-trained Models for Audio-Visual Downstream Tasks

Duan, Haoyi, Xia, Yan, Zhou, Mingze, Tang, Li, Zhu, Jieming, Zhao, Zhou

Dec-20-2023–arXiv.org Artificial Intelligence

In recent years, the deployment of large-scale pre-trained models in audio-visual downstream tasks has yielded remarkable outcomes. However, these models, primarily trained on single-modality unconstrained datasets, still encounter challenges in feature extraction for multi-modal tasks, leading to suboptimal performance. This limitation arises due to the introduction of irrelevant modality-specific information during encoding, which adversely affects the performance of downstream tasks. To address this challenge, this paper proposes a novel Dual-Guided Spatial-Channel-Temporal (DG-SCT) attention mechanism. This mechanism leverages audio and visual modalities as soft prompts to dynamically adjust the parameters of pre-trained models based on the current multi-modal input features. Specifically, the DG-SCT module incorporates trainable cross-modal interaction layers into pre-trained audio-visual encoders, allowing adaptive extraction of crucial information from the current modality across spatial, channel, and temporal dimensions, while preserving the frozen parameters of large-scale pre-trained models. Experimental evaluations demonstrate that our proposed model achieves state-of-the-art results across multiple downstream tasks, including AVE, AVVP, AVS, and AVQA. Furthermore, our model exhibits promising performance in challenging few-shot and zero-shot scenarios.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

Dec-20-2023

arXiv.org PDF

Add feedback

Country:
- Asia
  - China (0.14)
  - Middle East > Israel (0.14)

Genre:
- Research Report > New Finding (0.67)

Industry:
- Leisure & Entertainment (0.30)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks (0.67)
  - Natural Language > Large Language Model (0.67)
  - Vision (1.00)