Understanding Hardness of Vision-Language Compositionality from A Token-level Causal Lens