What Makes Multimodal Learning Better than Single (Provably)