Semantic Residual for Multimodal Unified Discrete Representation
Huang, Hai, Wang, Shulei, Xia, Yan
–arXiv.org Artificial Intelligence
--Recent research in the domain of multimodal unified representations predominantly employs codebook as representation forms, utilizing V ector Quantization(VQ) for quantization, yet there has been insufficient exploration of other quantization representation forms. Our work explores more precise quantization methods and introduces a new framework, Semantic Residual Cross-modal Information Disentanglement (SRCID), inspired by the numerical residual concept inherent to Residual V ector Quantization (RVQ). SRCID employs semantic residual-based information disentanglement for multimodal data to better handle the inherent discrepancies between different modalities. Our method enhances the capabilities of unified multimodal representations and demonstrates exceptional performance in cross-modal generalization and cross-modal zero-shot retrieval. Its average results significantly surpass existing state-of-the-art models, as well as previous attempts with RVQ and Finite Scalar Quantization (FSQ) based on these modals. Different modalities contain distinctly different information; for example, sounds present in audio may not have corresponding visual sources in video.
arXiv.org Artificial Intelligence
Dec-26-2024
- Country:
- Asia > China
- Zhejiang Province (0.14)
- Europe > Switzerland
- Asia > China
- Genre:
- Research Report (1.00)
- Technology: