Revisiting Temporal Modeling for CLIP-based Image-to-Video Knowledge Transferring