MERLOT: MultimodalNeuralScriptKnowledgeModels

Neural Information Processing Systems 

By pretraining with a mix of both framelevel (spatial) and video-level (temporal) objectives, our model not only learns to match images to temporally corresponding words, but also to contextualize what is happening globally over time.

Similar Docs  Excel Report  more

TitleSimilaritySource
None found