MERLOT: MultimodalNeuralScriptKnowledgeModels

Neural Information Processing Systems 

By pretraining with a mix of both framelevel (spatial) and video-level (temporal) objectives, our model not only learns to match images to temporally corresponding words, but also to contextualize what is happening globally over time.