MERLOT: MultimodalNeuralScriptKnowledgeModels
–Neural Information Processing Systems
By pretraining with a mix of both framelevel (spatial) and video-level (temporal) objectives, our model not only learns to match images to temporally corresponding words, but also to contextualize what is happening globally over time.
Neural Information Processing Systems
Feb-11-2026, 02:47:50 GMT
- Country:
- Industry:
- Technology: