Transferring Pre-trained Multimodal Representations with Cross-modal Similarity Matching