Long-Form Video-Language Pre-Training with Multimodal Temporal Contrastive Learning Y uchong Sun

Neural Information Processing Systems 

Large-scale video-language pre-training has shown significant improvement in video-language understanding tasks. Previous studies of video-language pre-training mainly focus on short-form videos (i.e., within 30 seconds) and sentences,