Multimodal Open-Vocabulary Video Classification via Pre-Trained Vision and Language Models