Pretrained Image-Text Models are Secretly Video Captioners