Sequential Compositional Generalization in Multimodal Models