VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners