Visual Representation Learning Guided By Multi-modal Prior Knowledge