Seeking the Sufficiency and Necessity Causal Features in Multimodal Representation Learning