Causal Graphical Models for Vision-Language Compositional Understanding