Pretrained Image-Text Models are Secretly Video Captioners

Open in new window