Understanding Transferable Representation Learning and Zero-shot Transfer in CLIP