Scalable and Accurate Self-supervised Multimodal Representation Learning without Aligned Video and Text Data

Open in new window