abb4847bbd60f38b1b7649d26c7a0067-Supplemental-Conference.pdf

Neural Information Processing Systems 

In Table 4 in the main paper, we summarized the multimedia retrieval results with the Mean Rank (MnR) averaged between video-to-text and text-to-video. In the main paper, we divide the tokens in half for the dual branches. Here, we test the model's performance with different partition strategies on EPIC-Kitchens and report the results in Table 6c . Default settings are shaded in gray . We further verify this claim by computing the feature distance between modalities on EPIC-Kitchens with the variants of our model used in Table 2 of the main paper.