abb4847bbd60f38b1b7649d26c7a0067-Supplemental-Conference.pdf

May-25-2025, 08:22:48 GMT–Neural Information Processing Systems

While multimodal learning methods need modality-complete data to learn the crossmodal correspondences, our method gives more effective cross-modal fusion for unseen modality combinations. In Table 4 in the main paper, we summarized the multimedia retrieval results with the Mean Rank (MnR) averaged between video-to-text and text-to-video. Here, we provide the full comparison with all the metrics for both text-to-video retrieval and video-to-text retrieval in Table 5. As recent multimodal learning methods need modality-complete data for training, our model outperforms these approaches on all metrics by effectively accumulating the information from any modality combinations. With fewer learnable tokens, the projected features become less discriminative since many of them with different semantic meanings need to match the same learable token.

artificial intelligence, machine learning, modality combination, (14 more...)

Neural Information Processing Systems

May-25-2025, 08:22:48 GMT

Conferences PDF

Add feedback

Technology:
- Information Technology > Artificial Intelligence > Machine Learning (1.00)