Evaluating Compositional Scene Understanding in Multimodal Generative Models