How Diffusion Models Learn to Factorize and Compose

Neural Information Processing Systems 

Large-scale text-to-image generative models can produce photo-realistic synthetic images that combine elements in a novel fashion (compositional generalization).