Semantic Residual for Multimodal Unified Discrete Representation

Dec-26-2024–arXiv.org Artificial Intelligence

--Recent research in the domain of multimodal unified representations predominantly employs codebook as representation forms, utilizing V ector Quantization(VQ) for quantization, yet there has been insufficient exploration of other quantization representation forms. Our work explores more precise quantization methods and introduces a new framework, Semantic Residual Cross-modal Information Disentanglement (SRCID), inspired by the numerical residual concept inherent to Residual V ector Quantization (RVQ). SRCID employs semantic residual-based information disentanglement for multimodal data to better handle the inherent discrepancies between different modalities. Our method enhances the capabilities of unified multimodal representations and demonstrates exceptional performance in cross-modal generalization and cross-modal zero-shot retrieval. Its average results significantly surpass existing state-of-the-art models, as well as previous attempts with RVQ and Finite Scalar Quantization (FSQ) based on these modals. Different modalities contain distinctly different information; for example, sounds present in audio may not have corresponding visual sources in video.

artificial intelligence, machine learning, natural language, (15 more...)

arXiv.org Artificial Intelligence

Dec-26-2024

arXiv.org PDF

Add feedback

Country:
- Asia > China
  - Zhejiang Province (0.14)
- Europe > Switzerland
  - Zürich > Zürich (0.14)

Genre:
- Research Report (1.00)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks (0.47)
  - Natural Language (1.00)