Self-supervised Pre-training for Transferable Multi-modal Perception