Supplementary Material for " Expectation-Maximization Contrastive Learning for Compact Video-and-Language Representations "