abb4847bbd60f38b1b7649d26c7a0067-Supplemental-Conference.pdf

Neural Information Processing Systems 

While multimodal learning methods need modality-complete data to learn the crossmodal correspondences, our method gives more effective cross-modal fusion for unseen modality combinations. In Table 4 in the main paper, we summarized the multimedia retrieval results with the Mean Rank (MnR) averaged between video-to-text and text-to-video. Here, we provide the full comparison with all the metrics for both text-to-video retrieval and video-to-text retrieval in Table 5. As recent multimodal learning methods need modality-complete data for training, our model outperforms these approaches on all metrics by effectively accumulating the information from any modality combinations. With fewer learnable tokens, the projected features become less discriminative since many of them with different semantic meanings need to match the same learable token.