Expectation-Maximization Contrastive Learning for Compact Video-and-Language Representations