CLIPTime: Time-Aware Multimodal Representation Learning from Images and Text