On the Causal Sufficiency and Necessity of Multi-Modal Representation Learning